Elasticsearch custom index mapping

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:
Graylog processing buffer gets clogged by message that contains 65000 characters in single line.
This relates to 32kb limit per field that elasticsearch has.
Related to original topic 32 kb limit per field, elasticsearch, Disabled analysis fields, index failure

2. Describe your environment:

  • OS Information: Linux RH 7.9

  • Package Version: Graylog 4.2.13

  • Service logs, configurations, and environment variables:
    Not relevant I think, system works until some “bad” messages gets stuck in buffer.
    Then graylog service is stopped, whole journal folder rm -rf *
    service start and all good, except few thousands messages are lost in the process - journal, queue.

3. What steps have you already taken to try and solve the problem?
Solution seems to be custom elastic index mapping, but I have 0 experience with that.

4. How can the community help?
Good documentation or practical example of mapping, that truncates bad message to lets say 30 000 characters, so that elastic does not choke on it and just indexes first 30 k characters of the line, fitting into 32 kb limit.
People reading logs should not send such messages, so they will just have to live with partial message.
Main goal is to let graylog process next messages in the queue.

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

The way you describe it, the large message is hung up in the processing queue which happens before the message is shipped to Elasticsearch for storage and searchability. Which means that even if you were to create a custom mapping in Elasticsearch where you might use ignore _above to truncate the data, it will never get there because it get hung up in processing with Graylog… at least that’s how I see things as happening.

You can’t modify the data coming in? Where.what is it coming from? how is it being shipped? (syslog, Nxlog, beats …) What is the input you are using? What kind of processing are you trying to do to the data? Can you give a (reasonable) sample of the data?

I can not modify data, well at least without something intercepting GELF TCP data stream from bunch of sources.
Normally graylog gets lines of text, but from time to time our talented pogrammers just dump actual payload in single log message to graylog, like json sting or whole document, some pdf file and such and it gets stuck in node processing buffer, everything stops, no more messages are processed.

Its under System - Nodes - Details - Actions - get process-buffer dump. I see 5 lines there inside {} and usually one is like 65000 or 1,5 million characters long in Notepad++ when I copy buffer.

GELF TCP on port X input accepting messages.

Processing, perhaps few regexp patterns in input.

{
“ProcessBufferProcessor #0”: "source: xxxxxxx | message: {
{ "displayedError":
{ "notifications": [
{ "type": "error", "code": "other", "message":
{ server: xxxxxx | apiName: xxxxxxxx | level: 3 | gl2_remote_ip: xxxxxx | gl2_remote_port: xxxxxxx|
project: xxxxxxx| className: ? | simpleClassName: ? | gl2_source_input: xxxxxxxx | environment: uat |

full_message:
600 lines with “bla bla bla expected type: String”

facility: logstash-gelf | timestamp: 2023-01-05T06:32:19.819Z }",

“ProcessBufferProcessor #1”: “idle”,
“ProcessBufferProcessor #2”: “idle”,
“ProcessBufferProcessor #3”: “idle”,
“ProcessBufferProcessor #4”: “idle”
}

Its normal “document” with fields, just field full_message contains 600x125… around 75000 characters

And this full_message field cannot be indexed.

So what are options to cut that full_message before it reaches the buffer?

I know it can be done at the source probably, but I cant control what crap is fed to that GELF TCP input…

I am willing to bet that your problem is a runaway regex on files that large. Can you post those with proper formatting (</>) and obfuscation?

You can set a rule in the initial stage that uses the substring() function that catches characters 1 thru x and then resets the message to be the results - then have subsequent stages do your magic. Certainly x will be large but it may be something to experiment with.

Holy cow @anonimus Graylog isnt a wiki its a log server :laughing:
Adding on to @tmacgbay suggestion. Perhaps those PDF/Document files can have there own INPUT/Index, just a thought.

Are you saying, that regex on input could be way to go?
Something like truncate everything above 20-30k characters?
Message comes on input and then defined regex patterns filter it, before it goes into buffer - is that correct?
How would such regexp work, I am not sure there is if then else logic. Is there?

Yes, well, I know it, you know it, but devs that dump whatever they want in there dont it looks like.
How would separate input help? Its single node with single buffer, all goes through this buffer no matter from which input.
As stated above, indexing happens after message is read from buffer, issue is that bad messages just get stuck buffer, cant be indexed.

1 Like

What I meant was… It looks like you said you were running a few regex filters on the Input where these large files are coming in. If the regex gets hung up trying to find something, particularly easy on a large file, you are likely to lock up buffers. I posted something here a while back for someone having similar issues (Pipeline processing appears to get stuck at times - #3 by tmacgbay) They moved some things around and seemed to have solved their issue. If you are currently running regex (or anything else) via extractor or pipeline against things coming in on that input, can you post?

My completely separate suggestion to handle/truncate the large file in the pipeline with the substring() command… won’t help if you have an extractor on the input that is using regex/GROK/causing the problem. I would think you could construct a regex that would capture the first 32,000 characters or so (Something like: ^.{1,32000}) and then use the result for your message - not sure what you actually have going on as a large message traverses to the locking point though… so maybe post up what you have? :person_shrugging:

Yes, input - manage extractors - there are 14 defined. Everything on single input.

Condition

Will only attempt to run if the message includes the string

Configuration

grok_pattern: xxxxxxxxxxxxxxx

Some are with few hits and millons of misses, some with few thousand hits.

One different one:

Condition

Will only attempt to run if the message matches the regular expression ^((?!xxx|xxx|xxx|xxx|xxx).)$*

Configuration

grok_pattern: xxxxxxxxxxxx

This last one is heavy one, Minimum: 15μs Maximum: 69,856μs, millions of hits.
Those are Microsecond as I understand. 1 / 1000 000 (10^-6)

Found this thread, seems exactly what I need.
Will try regex extractor, seems less invasive.
If that will not work, then custom mapping.

So… Here are the buffers below in a screen clip for clarification. If there is a problem with storing things in Elasticsearch, the “Output Buffer” would fill up…because Graylog is trying to Output to ElasticSearch.

image

But you stated your “Process Buffer” was filling… which means the issue is in Graylog when it is processing the message before storing it in ElasticSearch for future searching. You can mess with ElasticSearch all you want but it doesn’t solve your issue as described. On an Elasticsearch side note, that article is very old and is several versions back on Graylog from the 4.2.13 you are running… there have been significant architecture changes since then.

As messages come into your Graylog Inputs, they are examined by Graylog (which stores it’s configurations in MongoDB) then modified - as commanded - before shipping out to Elasticsearch. Here is a link to community built Message Process Flow

You may want to reconsider how you are processing messages, regex/GROK can be terribly inefficient (AND EVEN LOCK UP YOUR PROCESSING BUFFER) if you are not careful. Some of what you have lightly described may have to iterate a message multiple times to hit a match… that is a LOT of wasted processing failures!

Here are my recommendations:

  • Create more inputs and shift things that you control to those “Unknown to Developers” ports so you have better control of what you can control. (Actually, this ideas belongs to @gsmith :bat:)

  • Move all of your extractors off the Developer Input in and put them in a pipeline.

  • Create pipeline rules for the Developer Input that, in the first stage, either truncate large messages or drop them all together. Use the aforementioned-in-the-post functions. If you move your extractors completely to pipelines you can re-use the rules you have for both the Secret Inputs and the Developer Input… After truncation/drop of course.

Hope that helps… :smiley:

2 Likes

Process buffer yes - Node screen - Actions - get process buffer dump.
Normally its empty, when something gets stuck, I find it here - firs line or all five.

image

And messages are filling process buffer.

I think just very large message clogs up buffer, extractor or not, but I will test that, will try to let you know the results. Planning on creating custom extractor:


There were both solutions mentioned, will try extractor first, then maybe test mapping without extractor and both together, although if extractor will cut the tail off after 16000 characters, probably will take the win and implement it :smiley:

Yes, we plan to create more inputs, so that overall flow is divided and extractors look through less messages.

I don’t control messages incoming or extractors, just the graylog linux server / application itself. They created the regex extractors. Will see what can be done, as I said, I think its large message, would stuck even without extractors - will try to send into test environment some 100 000 characters on single line and see what happens.

Pipeline ? What does it do and how it is different from extractor?

Now that you mention it, I was thinking what is the order of extractors executing… Is it enough to just put it first alphabetically ? Or only way is the pipeline with order in which rules are applied?

I prefer pipelines because I feel they give me better control of managing messages and I enjoy finding efficient ways of writing rules and stages. I don’t use extractors at all. You can find documentation here. One thing to note in pipeline rules, they are set up in stages so that rules within a particular stage tend to run in parallel. If you have a rule that has a dependency on the results of another rule, make sure the second executing rule is in a stage that follows the initial rule. There are plenty of examples of both in the community.

You can make them very efficient where you can only apply the rule if certain criteria exist - which you can do with extractors… but using stages you can have one rule parse out fields, then in a following stage (because - dependency) You can take further actions based on the fields that were parsed out in the previous stage. I gave an example of it over here in this post.

2 Likes

Still in the process of testing, tried to kill test graylog with 150 000 character line, but filebeat was smarter than me, automatically truncated the message :smiley: there is some offset limit, tried to find how to change it, could not find it quickly.
In general will try to post the result of all the experiments.

2 Likes

Interesting results trying to break graylog so far - its more resilient than it seemed at first.
Pushed 250 000 char long line straight into log file, filebeat did not even blink, just chopped off the end and graylog processed message (beats input)
Then created GELF TCP input, pushed with echo {gelf stucture, full_message 400 000 long} | nc localhost port
Nothing… no end of line, just first N characters, no issues with the buffer.

I am starting to think, that our great and smart developers f-ed up their grok patterns.

1 Like

If you can post them GROKs, we can make suggestions…

May give us a view into what those devs are really up to…

man - curtain

1 Like

@anonimus

I concur with @tmacgbay

I did post them 11 days ago on January 9, will look again in the morning.
xxx in my post are specific names of prod services, basically heavy one with millions of hits is combing through every message and searching for one or two or three or four…

Condition

Will only attempt to run if the message matches the regular expression ^((?!xxx|xxx|xxx|xxx|xxx).)$*

Someone sees anything problematic?

{
  "extractors": [
    {
      "title": "name",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "\"*%{DATA:method} %{DATA:requestStatus} for country %{DATA:country } .*$"
      },
      "condition_type": "string",
      "condition_value": "docId"
    },
    {
      "title": "name2",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "\"*%{DATA:method} %{DATA:requestStatus} for country %{DATA:country } .*$"
      },
      "condition_type": "string",
      "condition_value": "took"
    },
    {
      "title": "name3",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "\"*%{DATA:method} %{DATA:requestStatus} for country %{DATA:country } and filename: %{DATA:filename}\"*$"
      },
      "condition_type": "string",
      "condition_value": "filename"
    },
    {
      "title": "name4",
      "extractor_type": "split_and_index",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "name4",
      "extractor_config": {
        "index": 2,
        "split_by": " "
      },
      "condition_type": "string",
      "condition_value": " records from "
    },
    {
      "title": "name5",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "\"%{DATA:duplicate} found. Ticket %{DATA:method} in Jira: %{DATA:jiraNumber} Topic Name %{DATA:topicName}\""
      },
      "condition_type": "none",
      "condition_value": ""
    },
    {
      "title": "name6",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "\"Ticket %{DATA:method} in Jira: %{DATA:jiraNumber} Topic Name %{DATA:topicName}\""
      },
      "condition_type": "none",
      "condition_value": ""
    },
    {
      "title": "name7",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "Received file: %{DATA:filename} for country: %{DATA:country} .*"
      },
      "condition_type": "string",
      "condition_value": "Received file"
    },
    {
      "title": "name8",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "\"Consumed %{BASE10NUM} records from topic %{DATA:topicName}\""
      },
      "condition_type": "string",
      "condition_value": " records from topic "
    },
    {
      "title": "name9",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "Topic Name %{DATA:topicName} Consumed records %{DATA:consumedRecords}\""
      },
      "condition_type": "string",
      "condition_value": " Consumed records "
    },
    {
      "title": "name10",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "\"%{DATA:duplicate} found. Jira: %{DATA:jiraNumber} Tolerant score: %{DATA:score} Topic Name %{DATA:topicName}\""
      },
      "condition_type": "string",
      "condition_value": " Tolerant score: "
    },
    {
      "title": "name11",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "\"*%{DATA:method} %{DATA:requestStatus} for country %{DATA:country}\"*$"
      },
      "condition_type": "regex",
      "condition_value": "^((?!docId|took|filename|journalLines|Journal).)*$"
    },
    {
      "title": "name12",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "\\{\n  \"displayedError\": \\{\n    \"notifications\": \\[\n      \\{\n        \"type\": %{DATA:type},\n        \"code\": %{DATA:code},\n        \"message\": %{DATA:message},\n        \"logId\": %{DATA:logId},\n      \\}\n    \\]\n  \\}\\,\n  \"description\": .*?,\n  \"clientId\": %{DATA:clientId},\n  \"countryCode\": %{DATA:countryCode},\n  \"remoteAddress\": %{DATA:remoteAddress}\n\\}"
      },
      "condition_type": "string",
      "condition_value": "displayedError"
    },
    {
      "title": "name13",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "\\{\n  \"displayedError\": \\{\n    \"notifications\": \\[\n      \\{\n        \"type\": .*?,\n        \"code\": .*?,\n        \"message\": .*?,\n        \"logId\": .*?\n      \\}\n    \\]\n  \\}\\,\n  \"description\": .*?,\n  \"originalError\": \\{\n    \"notifications\": \\[\n      \\{\n        \"type\": %{DATA:type},\n        \"code\": %{DATA:code},\n        \"message\": %{DATA:message},\n        \"logId\": %{DATA:logId},\n        \"isExposable\": %{DATA:isExposable}\n      \\}\n    \\]\n  \\},\n  \"clientId\": %{DATA:clientId},\n  \"countryCode\": %{DATA:countryCode},\n  \"remoteAddress\": %{DATA:remoteAddress}\n\\}"
      },
      "condition_type": "string",
      "condition_value": "originalError"
    },
    {
      "title": "name14",
      "extractor_type": "grok",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "full_message",
      "target_field": "",
      "extractor_config": {
        "grok_pattern": "Topic Name %{DATA:topicName} %{DATA:wrong} message in systemname"
      },
      "condition_type": "none",
      "condition_value": ""
    }
  ],
  "version": "4.2.13"
}

I am looking at this one as potential issue…