Process Buffer Flooding 100% process

I have a problem with Graylog, after 6 hours of normal operation the Process Buffes floods and the processor is in 100% of use. I have already made the following changes:

inputbuffer_processors = 2
output_batch_size = 4000
outputbuffer_processors = 4
processbuffer_processors = 10

GRAYLOG_SERVER_JAVA_OPTS="-Xms6g -Xmx6g

Restarting the graylog the problem is solved, but after 6 hours they come back.

I don’t know where I see the average number of messages per second, but I believe there are not many.
The virtual machine has 16 cores and 32G of RAM.
I have 1 node, 1 input with 7 configured extractors.
Index Set use Rotation strategy: “Index Message Count” with Max documents per index: 20000000

System
Version: 3.2.4+a407287, codename Ethereal Elk
JVM: PID 28820, Oracle Corporation 1.8.0_242 on Linux 3.10.0-1062.18.1.el7.x86_64

Can someone help me?

he @spawnzao

did you have the same server with MongoDB, Elasticsearch and Graylog on the same server?

Yes on the same server, worked for over 2 years…
MongoDB uses 0.6 CPU on average, normally or in trouble.
Elastic uses 3.8 CPU on average, normally or in trouble.
Graylog uses 1.2 CPU on average normally and 36.2 in trouble.

Does anyone know what’s going on or where I can debug?

he @spawnzao

I guess that your elasticsearch is filled with data?

Both, Graylog and Elasticsearch are fighting for the computing power.

Reduce the index_refresh time to 30 seconds. Pin a specific amount of CPU cores to Elasticsearch.

1 Like

If you have any Extractors, check them also… you could have some poorly performing extractors that are causing issues as well. From what I see, you’ve also oversubscribed your CPUs… if you have 16 cores, and have them all allocated to something in Graylog, then what is Elasticsearch using?

Also, check your hypervisor to make sure your disk IO isn’t a bottleneck.

he @jan @cawfehman , thank you for all your support!

I store log of servers and services (firewall, proxy, http, ids, hids, ftp, wifi, windows and linux)
How do I specify a number of CPU cores to Elasticsearch?
How I see the index_refresh time?

Disk I/O is not a bottleneck (I guess), as the virtual machine did not trigger an incident. I checked and the average latency 0.394 milliseconds with a few peaks of 7 milliseconds (3 in the 24 hour interval) and an average of 294.85 KBps. I have servers with much higher disk performance.

I increased HEAP_SIZE to 8g, -Xms to 8g and -Xmx to 8g, but it didn’t work.

I don’t know where to look anymore … CPU usage, memory, disk I/O, interrupts, load average stay normal until the time the jvm crashes and begin to flood the Process buffer. I don’t know if the problem is graylog or elasticsearch

My confs:

elasticsearch

ES_HEAP_SIZE=8g

jvm.options

-Dfile.encoding=UTF-8
-Dio.netty.noKeySetOptimization=true
-Dio.netty.noUnsafe=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Djava.awt.headless=true
-Djna.nosys=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-XX:+AlwaysPreTouch
-XX:+HeapDumpOnOutOfMemoryError
-XX:+PrintGCDetails
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:-OmitStackTraceInFastThrow
-XX:CMSInitiatingOccupancyFraction=75
-Xloggc:/var/log/elasticsearch/graylog/gc.log
-Xms8g
-Xmx8g
-Xss1m
-server

gc.log

3823.362: [GC (Allocation Failure) 3823.362: [ParNew: 900401K->16177K(996800K), 0.0241180 secs] 2020317K->1136252K(8277888K), 0.0242835 secs] [Times: user=0.16 sys=0.13, real=0.03 secs]
3837.587: [GC (Allocation Failure) 3837.587: [ParNew: 901959K->13669K(996800K), 0.0175451 secs] 2022034K->1133750K(8277888K), 0.0177333 secs] [Times: user=0.20 sys=0.01, real=0.01 secs]
3855.801: [GC (Allocation Failure) 3855.801: [ParNew: 899542K->6015K(996800K), 0.0106909 secs] 2019623K->1126260K(8277888K), 0.0108646 secs] [Times: user=0.11 sys=0.01, real=0.01 secs]
3885.989: [GC (Allocation Failure) 3885.989: [ParNew: 891645K->8303K(996800K), 0.0246127 secs] 2011890K->1128836K(8277888K), 0.0248031 secs] [Times: user=0.18 sys=0.12, real=0.02 secs]
3915.259: [GC (Allocation Failure) 3915.259: [ParNew: 894124K->14924K(996800K), 0.0108161 secs] 2014657K->1135617K(8277888K), 0.0110169 secs] [Times: user=0.12 sys=0.00, real=0.01 secs]
3924.441: [GC (Allocation Failure) 3924.441: [ParNew: 901004K->22460K(996800K), 0.0138323 secs] 2021697K->1143158K(8277888K), 0.0139922 secs] [Times: user=0.16 sys=0.00, real=0.02 secs]
3946.732: [GC (Allocation Failure) 3946.732: [ParNew: 908540K->12905K(996800K), 0.0222552 secs] 2029238K->1133925K(8277888K), 0.0224075 secs] [Times: user=0.17 sys=0.11, real=0.02 secs]

[root@xxxxxx user]# curl -XGET “http://xxxx:9200/_cluster/stats?pretty=true

{
“_nodes” : {
“total” : 1,
“successful” : 1,
“failed” : 0
},
“cluster_name” : “graylog”,
“cluster_uuid” : “sx7BYMScs8g”,
“timestamp” : 1586744687356,
“status” : “green”,
“indices” : {
“count” : 25,
“shards” : {
“total” : 100,
“primaries” : 100,
“replication” : 0.0,
“index” : {
“shards” : {
“min” : 4,
“max” : 4,
“avg” : 4.0
},
“primaries” : {
“min” : 4,
“max” : 4,
“avg” : 4.0
},
“replication” : {
“min” : 0.0,
“max” : 0.0,
“avg” : 0.0
}
}
},
“docs” : {
“count” : 402319292,
“deleted” : 0
},
“store” : {
“size_in_bytes” : 179782902411
},
“fielddata” : {
“memory_size_in_bytes” : 898672,
“evictions” : 0
},
“query_cache” : {
“memory_size_in_bytes” : 481928,
“total_count” : 58924,
“hit_count” : 863,
“miss_count” : 58061,
“cache_size” : 96,
“cache_count” : 124,
“evictions” : 28
},
“completion” : {
“size_in_bytes” : 0
},
“segments” : {
“count” : 154,
“memory_in_bytes” : 455647912,
“terms_memory_in_bytes” : 361462168,
“stored_fields_memory_in_bytes” : 88130160,
“term_vectors_memory_in_bytes” : 0,
“norms_memory_in_bytes” : 19712,
“points_memory_in_bytes” : 5742720,
“doc_values_memory_in_bytes” : 293152,
“index_writer_memory_in_bytes” : 0,
“version_map_memory_in_bytes” : 0,
“fixed_bit_set_memory_in_bytes” : 0,
“max_unsafe_auto_id_timestamp” : -1,
“file_sizes” : { }
}
},
“nodes” : {
“count” : {
“total” : 1,
“data” : 1,
“coordinating_only” : 0,
“master” : 1,
“ingest” : 1
},
“versions” : [
“6.8.7”
],
“os” : {
“available_processors” : 16,
“allocated_processors” : 16,
“names” : [
{
“name” : “Linux”,
“count” : 1
}
],
“pretty_names” : [
{
“pretty_name” : “CentOS Linux 7 (Core)”,
“count” : 1
}
],
“mem” : {
“total_in_bytes” : 33564647424,
“free_in_bytes” : 1377382400,
“used_in_bytes” : 32187265024,
“free_percent” : 4,
“used_percent” : 96
}
},
“process” : {
“cpu” : {
“percent” : 0
},
“open_file_descriptors” : {
“min” : 1215,
“max” : 1215,
“avg” : 1215
}
},
“jvm” : {
“max_uptime_in_millis” : 25524018,
“versions” : [
{
“version” : “1.8.0_242”,
“vm_name” : “OpenJDK 64-Bit Server VM”,
“vm_version” : “25.242-b08”,
“vm_vendor” : “Oracle Corporation”,
“count” : 1
}
],
“mem” : {
“heap_used_in_bytes” : 1619999104,
“heap_max_in_bytes” : 8476557312
},
“threads” : 121
},
“fs” : {
“total_in_bytes” : 1275511922688,
“free_in_bytes” : 1092712706048,
“available_in_bytes” : 1092712706048
},
“plugins” : ,
“network_types” : {
“transport_types” : {
“netty4” : 1
},
“http_types” : {
“netty4” : 1
}
}
}
}

I found the problem, it was a problem with an extractor.

In the APi browser I chose the process buffer dump (GET /system/processbufferdump).
It shows which processor was locked and with that I fixed the buggy extractor.

Like that:

{
“bb39”: {
“processbuffer_dump”: {
“ProcessBufferProcessor # 0”: "source: server | message: server pureftp-log Thu Apr 16 00:55:38 2020 1 message log i {level: 6 | gl2_remote_ip: xxxx | gl2_remote_port: 37267 | gl2_source_node: bb39 | _id: f971 | gl2_source_input: 5ac6 | facility: local4 | timestamp: 2020-04-16T00: 55: 44.000-03: 00} ",
“ProcessBufferProcessor # 1”: "source: server | message: server pureftp-log Thu Apr 16 00:36:42 2020 1 message log * i {level: 6 | gl2_remote_ip: xxxx | gl2_remote_port: 37267 | gl2_source_node: bb39 | _id: f971 | gl2_source_input: 5ac6 | facility: local4 | timestamp: 2020-04-16T00: 36: 43.000-03: 00} ",
“ProcessBufferProcessor # 2”: “idle”,
“ProcessBufferProcessor # 3”: “idle”,
“ProcessBufferProcessor # 4”: “idle”,
“ProcessBufferProcessor # 5”: “idle”
}
}
}

Thank you so much for all your support

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.