Sudden and sustained CPU spikes

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:
I have had a sudden CPU spike that has been ongoing for most of the day. Prior to this CPU utilization has been low and no changes have been made to the deployment.

2. Describe your environment:

  • OS Information:
    Ubuntu 20.04.5 LTS

  • Package Version:
    4.3.9

  • Service logs, configurations, and environment variables:
    In the server.log file i see the following:

@graylog:/var/log/graylog-server$ tail server.log
        at org.graylog.shaded.elasticsearch7.org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
        at org.graylog.shaded.elasticsearch7.org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
        at org.graylog.shaded.elasticsearch7.org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
        ... 33 more
Caused by: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [message] in order to load field data by uninverting the inverted index. Note that this can use significant memory.]]
        at org.graylog.shaded.elasticsearch7.org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
        at org.graylog.shaded.elasticsearch7.org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
        at org.graylog.shaded.elasticsearch7.org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
        ... 35 more

3. What steps have you already taken to try and solve the problem?
Ran a ‘htop’ command to see that the graylog process is utilizing all cores. Considering rebooting but have not done so.

4. How can the community help?

If someone can point to why the sudden increase in CPU utilization after weeks of no changes to the deployment.

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

Hello @michmoor && welcome

From what I can understand/see is this part of the logs.

reason=Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [message] in order to load field data by uninverting the inverted index. Note that this can use significant memory.]]

Need some more information to help you further. Chances are something is going on with Elasticsearch.
It could be bad GROK pattern, Pipeline, Dashboard Widget, index template, etc…

Hi @gsmith . Thanks for responding back! My graylog is still in the POC/QA stage so this is just a one node cluster. For months this has been operating flawlessly. I’ve made the application upgrades without issues so its strange that all of a sudden this issue cropped up. Interestingly enough the CPU has dialed down just a bit but still very high so the alerts arent as constant. I do have some dashboards, no pipelines. What information can i start giving you? Im the only one in charge of this deployment so i know there hasnt been any modifications/changes tot he graylog configuration.

Do you use regex for parsing? Are you sure, that you have no logs doing a regexploit in your parsing?
Did you try to restart your Graylog-Service?

1 Like

I did restart the Graylog-service but that didnt help.
I am using regex for parsing. Reviewing your link now to see if theres a potential issue

Interesting development. I have shut down one of my inputs as I know that one is receiving more data than the others and wouldnt you know it the CPU utilization dropped immediately after. This is very strange as I made no changes to the configured extractors. This was all working fine. Will continue to debug but i honestly dont know where to begin.

Do you do a lot of parsing from that input? Regex and Groks can be very challenging if there is a high amount of logs passing through.

I had this phenomina once and after hours of digging deep, it turned out that one of the Graylog users had forgotten a Firefox tab running a heavy “All time” query with the “Play” button enabled…

1 Like

I do indeed do parsing from that input.
So for everyone watching, this turned out to be the extractor I’ve been using. What’s interesting is that absolutely nothing has changed. Perhaps the volume of the logs increased which is something i would need to investigate more. But for now that seems to be root cause.
Thank you for everyone chiming in and offering assistance. Appreciate you!

1 Like

@michmoor Thanks for chiming back in on this issue, I was also watching this post.

@ihe I really like the link you posted :+1:

1 Like

Maybe I should writeup a little “how to parse stuff on scale”. That link is one of the important sources :slight_smile:

That would be awesome.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.