Graylog-ES Communications

Is there a quick command to determine if ES is telling GL to wait or maybe which ES data node is rejecting graylog? For instance, when my graylog stops sending messages into the elasticsearch database, the only way to get it flowing again appears to be recycling the elasticsearch service on each of the data nodes one at a time.

you should watch the log files of graylog and elasticsearch - that process should be visible.

We have experienced the same issue, and during these periods both the node and index stats in Elasticsearch indicate that no throttling is taking place, either for store or indexing. I have my Graylog and Elasticsearch logs pulled in to Graylog via Sidecar, and there is nothing logged in either as to why this is happening. My 2nd node backed up by over 10 million events yesterday morning, not a thing logged to explain why. No throttling, no real load on any of the systems, no errors logged during this time regarding this or anything else going on in the clusters.

I’ve been running nload on the graylog servers so I can easily visualize (because the output doesn’t match the input) when they stop sending but I think I may just set a cron to recycle the service on each data node until I can figure out why. The funny thing is once you recycle the service, the messages will fly out at tens of thousands a second. I set up rsyslog to dump the elasticsearch and graylog servers logs to a single server for ease of troubleshooting. Beside gc intervals, I am just seeing some not processed messages because the docker logs are using the word info instead of an integer. I occassionally get JavaOutOfMemory issues but haven’t been able to nail down exactly why yet.

I have also been plagued by similar problems, and they seem to have come from some malfunction in the elasticsearch cluster.
If you dont replicate your indices in elasticsearch, then just one bad elasticsearch node is going to ruin your indexing, because the current writable index is spread among all nodes.

My problems was also memory, bad network/virtualisation performance at times, which would degrade performance of an elasticsearch node enough that it was kicked out.

I fixed memory with the following parms:
indices.fielddata.cache.size: 20%

And I moved all ES nodes closer together in a VLAN instead of dispersed over 2 data centers. And remember to upgrade!

Brgds. Martin

I found a query that never completed on a dashboard although it was properly defined and limited.
Also, the following helped greatly:
o increased the tolerance for health checks between the nodes

  • timeout from 30s to 60s
  • interval from 1s to 3s
  • added retries 2 attempts
    o Increased max_bytes_sec from 150mb to 250mb

Graylog has never functioned so fast; over 335 million records returned in just 5 seconds.

2 Likes

@JoeG

Would you mind writing a little more details down into the forum that people can reference to your experience and get a solution that is working for them.

That would include the configuration files and the changed settings.

thank you!

To begin with we have a virtual environment with 10GB NICs. We use jumbo frames so our MTU is set to 9000 in the /etc/sysconfig/network-scripts/ifcfg-* files. The GL servers are 16 CPU with 32GB RAM; the ES servers are 4 CPU with 16GB RAM for the master-eligible nodes, and 8 CPU with 64GB RAM for the data nodes. After ruling out that neither system nor network seemed to be a factor, we attempted further tune the system.
The pertinent information in the /etc/elasticsearch/cluster1/elasticsearch.yml file is as follows:

discovery:
  zen:
    minimum_master_nodes: 2 	# We have three master-eligble and 10 data nodes
    fd:
      ping_interval: 3s
      ping_timeout: 60s
      ping_retries: 2
    ping:
      unicast:
        hosts: node-01:9300,node-02:9300,node-03:9300,...
network:
  host: [ "_eth0_", "127.0.0.1" , "[::1]" ]
  publish_host: _site_
bootstrap:
  mlockall: true
indices:
  fielddata:
    cache:
      size: 20%
  memory:
    index_buffer_size: 50%
  store:
    throttle:
      max_bytes_sec: 250mb

In the /etc/graylog/server/server.conf, we have the following:
output_flush_interval = 1

# As stream outputs are loaded only on demand, an output which is failing to initialize will be tried over and
# over again. To prevent this, the following configuration options define after how many faults an output will
# not be tried again for an also configurable amount of seconds.
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30

# The number of parallel running processors.
# Raise this number if your buffers are filling up.
processbuffer_processors = 16
outputbuffer_processors = 8

stream_processing_timeout = 3000
stream_processing_max_faults = 50

elasticsearch_index_optimization_jobs = 50
elasticsearch_request_timeout = 2m
message_journal_max_size = 10gb

ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 8
processor_wait_strategy = blocking
output_batch_size = 2000

With this, I can query over 187 million messages in 1524ms.

2 Likes

Very interesting. Out of curiosity what is the storage on your Elasticsearch data nodes? Disk type/quantity/config? Also are you adhering to any upper limit on per-node data before adding additional data nodes? Oh also, what are you using for shard/replica count?

I am still waiting for the answers to your questions from our storage team. Are you thinking that we can decrease the search time even more? As far as shard/replica count. I have 10 data nodes so I have a total of 10 shards (1 primary and 1 replica each broken down into 5 shards) per index. My total shard count is around 900 due to different retention limits and log types.

No I was thinking more along the lines of “how are others configuring their systems/clusters”. Your results look great, ours aren’t that responsive but my current data nodes use large but slow MDL-SAS drives, and I’m way over the recommended size for per-node data. I’m running 2 shard / 1 replica for my indexes, averaging around 150 million events per day, so I’m also well over the per-shard data size I’ve seen recommended as well. I’m planning on moving to a hot-warm Elasticsearch architecture after Graylog 2.3 is out, using all SSD on my hot nodes, keeping no ore than 30 days data on them, and repurposing my existing 4 data nodes as the warm nodes, keeping a lot more data there.

While I’m not sure if this will apply for everyone experiencing the situation where a Graylog node stops sending messages to Elasticsearch, I cautiously say I think I have ours greatly improved. It turned out to be poorly written regex/replace-with-regex extractors. One of the most egregious examples was extracting source port from access logs sent via syslog udp from our IPS. When I started tuning that regex string the mean value for the execution time metric was in the 40,000 - 45,000 microsecond range. After all is said and done I have that down to 44 microseconds, pretty significant improvement.

In going through all of our extractors on all of our inputs, I’ve added run conditions on everything, and have avoided using a regex condition if at all possible.

Our scenario was that I’d notice that output events per second dropped well blowing input events per second, and that one or more of our nodes had “stopped sending to Elasticsearch”. Input and Output buffers were empty, but Process buffer was full. We thought initially that our Elasticsearch cluster wasn’t keeping up, but when I dug in I found no evidence or throttling or high I/O wait times. I’m running 7200rpm midline-SAS drives on them, in 3 and 4 drive Raid0 groups, which do cause some bottlenecking during heavier ingest/search periods, but ultimately that was not the cause of our problem.

Now we do occasionally receive some very large messages, up to 7MB or so, but our poorly/lazily written regex patterns seem to be what was causing the node(s) to bog down so badly as to appear to have stopped processing.

This blog post from Loggly helped me understand what I was doing wrong, and got mean execution time down from 45k, to 4k, to 44 microseconds. Over the past 6 weeks I’ve had times when my nodes back up as badly as 35 million messages each.

https://www.loggly.com/blog/regexes-the-bad-better-best/

John

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.