Correct mapping for gl2_message_id

Hello Graylog community,

I have noticed that graylog’s index templates have specific mappings only for full_message, gl2_accounted_message_size, gl2_processing_timestamp, message, source, streams and timestamp

However, there is no specific mapping for gl2_message_id, so it gets indexed by the default rule: as a keyword with both inverted index and doc values enabled.

I found that the doc values might be used for tie-breaking, but is there any use for the inverted index?

I ask, because, not counting _source, Add a disk-usage analysis API · Issue #68508 · elastic/elasticsearch · GitHub flags this gl2_message_id field as the second most disk-hungry, right after “message”.

Hello,

This is a hard question to answer.
Are you trying to shorten the _id or to remove it? If either one is correct, I’m not sure how to go about it.
If this is incorrect could you explain what your trying to achieve?

Hi gsmith. I see you’ve been promoted to Leader in the meantime.

_id is an unique message id created by ES that allows us to fetch individual logs.
It is not a normal field, so I don’t think its mapping can be changed. Anyway, I don’t what to mess with it.

gl2_message_id is a “random string” that graylog puts in all mesages.
This is a normal message field, so its Mappings can be easily changed.

By default Mappings this field eats disk space by three ways:

  • Actual data, stored in a field named _source,
  • Inverted index of the field, allowing using this field in search (Eating 1.5% of my whole disk!),
  • doc_values, apparently used for tie-breaking in graylog (Eating another 1.5% of my disk).

Is this Mapping correct? Uses graylog both these disk-hungry features,
or I can disable inverted index feature and free 1.5% of my disk?

Hello,

Yes, gl2_message_id is the identifier for each unique message It will be set to a ULID during processing. . As far as I know it acts like _id .Using ULIDs results in shorter IDs (26 characters for ULID vs 36 for UUID) and thus reduced storage usage.

I haven’t seen someone do that, but if you get it to work without breaking it, I would be curious to see how you went about doing it.

So on that note, In this post below I showed the _id and the gl2_message_id

As you can see I can search with both of them, I also use this ${message.id}. So again I’m not sure. To be honest I would use a dev VM and try to adjust it to your needs and see happens. By chance have you posted in GitHub about this? I would think that one of the staff members would be able to answer this question with more detail.

EDIT: I forgot to mention if gl2_message_id is a concern have you thought about creating a custom index?

`

Thank you for the replies.

I would expect ULID to reduce storage size only in _source and only if they compress better.
But they are unique values, so I don’t expect big savings on the reversed index.

Sadly, I can’t ask on GitHub anymore. I don’t want to register for Microsoft account.

What do you mean by “custom index”? Just custom mappings on graylog indices?
Sure. Lot of my non-builtin fields use custom mapping. I even changed gl2_remote_ip to type ip so I can search by source subnet and message to match_only_text to save disk space.

Hello,

Yes, I was referring to more on the aspects of creating a new index template, since Elasticsearch by default is dynamic this option can be turnoff or create a static index template /mapping. Just an idea for saving disk space.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.