Process Buffer Flooding 100% process

spawnzao · March 27, 2020, 2:44pm

I have a problem with Graylog, after 6 hours of normal operation the Process Buffes floods and the processor is in 100% of use. I have already made the following changes:

inputbuffer_processors = 2
output_batch_size = 4000
outputbuffer_processors = 4
processbuffer_processors = 10

GRAYLOG_SERVER_JAVA_OPTS="-Xms6g -Xmx6g

Restarting the graylog the problem is solved, but after 6 hours they come back.

I don’t know where I see the average number of messages per second, but I believe there are not many.
The virtual machine has 16 cores and 32G of RAM.
I have 1 node, 1 input with 7 configured extractors.
Index Set use Rotation strategy: “Index Message Count” with Max documents per index: 20000000

System
Version: 3.2.4+a407287, codename Ethereal Elk
JVM: PID 28820, Oracle Corporation 1.8.0_242 on Linux 3.10.0-1062.18.1.el7.x86_64

Can someone help me?

jan · March 30, 2020, 1:00pm

he @spawnzao

did you have the same server with MongoDB, Elasticsearch and Graylog on the same server?

spawnzao · March 31, 2020, 3:56am

Yes on the same server, worked for over 2 years…
MongoDB uses 0.6 CPU on average, normally or in trouble.
Elastic uses 3.8 CPU on average, normally or in trouble.
Graylog uses 1.2 CPU on average normally and 36.2 in trouble.

Does anyone know what’s going on or where I can debug?

jan · March 31, 2020, 6:40am

he @spawnzao

I guess that your elasticsearch is filled with data?

Both, Graylog and Elasticsearch are fighting for the computing power.

Reduce the index_refresh time to 30 seconds. Pin a specific amount of CPU cores to Elasticsearch.

cawfehman · April 1, 2020, 12:58am

If you have any Extractors, check them also… you could have some poorly performing extractors that are causing issues as well. From what I see, you’ve also oversubscribed your CPUs… if you have 16 cores, and have them all allocated to something in Graylog, then what is Elasticsearch using?

Also, check your hypervisor to make sure your disk IO isn’t a bottleneck.

spawnzao · April 12, 2020, 9:01pm

he @jan @cawfehman , thank you for all your support!

I store log of servers and services (firewall, proxy, http, ids, hids, ftp, wifi, windows and linux)
How do I specify a number of CPU cores to Elasticsearch?
How I see the index_refresh time?

Disk I/O is not a bottleneck (I guess), as the virtual machine did not trigger an incident. I checked and the average latency 0.394 milliseconds with a few peaks of 7 milliseconds (3 in the 24 hour interval) and an average of 294.85 KBps. I have servers with much higher disk performance.

I increased HEAP_SIZE to 8g, -Xms to 8g and -Xmx to 8g, but it didn’t work.

I don’t know where to look anymore … CPU usage, memory, disk I/O, interrupts, load average stay normal until the time the jvm crashes and begin to flood the Process buffer. I don’t know if the problem is graylog or elasticsearch

My confs:

elasticsearch

ES_HEAP_SIZE=8g

jvm.options

-Dfile.encoding=UTF-8
-Dio.netty.noKeySetOptimization=true
-Dio.netty.noUnsafe=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Djava.awt.headless=true
-Djna.nosys=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-XX:+AlwaysPreTouch
-XX:+HeapDumpOnOutOfMemoryError
-XX:+PrintGCDetails
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:-OmitStackTraceInFastThrow
-XX:CMSInitiatingOccupancyFraction=75
-Xloggc:/var/log/elasticsearch/graylog/gc.log
-Xms8g
-Xmx8g
-Xss1m
-server

gc.log

3823.362: [GC (Allocation Failure) 3823.362: [ParNew: 900401K->16177K(996800K), 0.0241180 secs] 2020317K->1136252K(8277888K), 0.0242835 secs] [Times: user=0.16 sys=0.13, real=0.03 secs]
3837.587: [GC (Allocation Failure) 3837.587: [ParNew: 901959K->13669K(996800K), 0.0175451 secs] 2022034K->1133750K(8277888K), 0.0177333 secs] [Times: user=0.20 sys=0.01, real=0.01 secs]
3855.801: [GC (Allocation Failure) 3855.801: [ParNew: 899542K->6015K(996800K), 0.0106909 secs] 2019623K->1126260K(8277888K), 0.0108646 secs] [Times: user=0.11 sys=0.01, real=0.01 secs]
3885.989: [GC (Allocation Failure) 3885.989: [ParNew: 891645K->8303K(996800K), 0.0246127 secs] 2011890K->1128836K(8277888K), 0.0248031 secs] [Times: user=0.18 sys=0.12, real=0.02 secs]
3915.259: [GC (Allocation Failure) 3915.259: [ParNew: 894124K->14924K(996800K), 0.0108161 secs] 2014657K->1135617K(8277888K), 0.0110169 secs] [Times: user=0.12 sys=0.00, real=0.01 secs]
3924.441: [GC (Allocation Failure) 3924.441: [ParNew: 901004K->22460K(996800K), 0.0138323 secs] 2021697K->1143158K(8277888K), 0.0139922 secs] [Times: user=0.16 sys=0.00, real=0.02 secs]
3946.732: [GC (Allocation Failure) 3946.732: [ParNew: 908540K->12905K(996800K), 0.0222552 secs] 2029238K->1133925K(8277888K), 0.0224075 secs] [Times: user=0.17 sys=0.11, real=0.02 secs]

spawnzao · April 13, 2020, 2:28am

[root@xxxxxx user]# curl -XGET “http://xxxx:9200/_cluster/stats?pretty=true”

{
“_nodes” : {
“total” : 1,
“successful” : 1,
“failed” : 0
},
“cluster_name” : “graylog”,
“cluster_uuid” : “sx7BYMScs8g”,
“timestamp” : 1586744687356,
“status” : “green”,
“indices” : {
“count” : 25,
“shards” : {
“total” : 100,
“primaries” : 100,
“replication” : 0.0,
“index” : {
“shards” : {
“min” : 4,
“max” : 4,
“avg” : 4.0
},
“primaries” : {
“min” : 4,
“max” : 4,
“avg” : 4.0
},
“replication” : {
“min” : 0.0,
“max” : 0.0,
“avg” : 0.0
}
}
},
“docs” : {
“count” : 402319292,
“deleted” : 0
},
“store” : {
“size_in_bytes” : 179782902411
},
“fielddata” : {
“memory_size_in_bytes” : 898672,
“evictions” : 0
},
“query_cache” : {
“memory_size_in_bytes” : 481928,
“total_count” : 58924,
“hit_count” : 863,
“miss_count” : 58061,
“cache_size” : 96,
“cache_count” : 124,
“evictions” : 28
},
“completion” : {
“size_in_bytes” : 0
},
“segments” : {
“count” : 154,
“memory_in_bytes” : 455647912,
“terms_memory_in_bytes” : 361462168,
“stored_fields_memory_in_bytes” : 88130160,
“term_vectors_memory_in_bytes” : 0,
“norms_memory_in_bytes” : 19712,
“points_memory_in_bytes” : 5742720,
“doc_values_memory_in_bytes” : 293152,
“index_writer_memory_in_bytes” : 0,
“version_map_memory_in_bytes” : 0,
“fixed_bit_set_memory_in_bytes” : 0,
“max_unsafe_auto_id_timestamp” : -1,
“file_sizes” : { }
}
},
“nodes” : {
“count” : {
“total” : 1,
“data” : 1,
“coordinating_only” : 0,
“master” : 1,
“ingest” : 1
},
“versions” : [
“6.8.7”
],
“os” : {
“available_processors” : 16,
“allocated_processors” : 16,
“names” : [
{
“name” : “Linux”,
“count” : 1
}
],
“pretty_names” : [
{
“pretty_name” : “CentOS Linux 7 (Core)”,
“count” : 1
}
],
“mem” : {
“total_in_bytes” : 33564647424,
“free_in_bytes” : 1377382400,
“used_in_bytes” : 32187265024,
“free_percent” : 4,
“used_percent” : 96
}
},
“process” : {
“cpu” : {
“percent” : 0
},
“open_file_descriptors” : {
“min” : 1215,
“max” : 1215,
“avg” : 1215
}
},
“jvm” : {
“max_uptime_in_millis” : 25524018,
“versions” : [
{
“version” : “1.8.0_242”,
“vm_name” : “OpenJDK 64-Bit Server VM”,
“vm_version” : “25.242-b08”,
“vm_vendor” : “Oracle Corporation”,
“count” : 1
}
],
“mem” : {
“heap_used_in_bytes” : 1619999104,
“heap_max_in_bytes” : 8476557312
},
“threads” : 121
},
“fs” : {
“total_in_bytes” : 1275511922688,
“free_in_bytes” : 1092712706048,
“available_in_bytes” : 1092712706048
},
“plugins” : ,
“network_types” : {
“transport_types” : {
“netty4” : 1
},
“http_types” : {
“netty4” : 1
}
}
}
}

spawnzao · April 23, 2020, 2:08am

I found the problem, it was a problem with an extractor.

In the APi browser I chose the process buffer dump (GET /system/processbufferdump).
It shows which processor was locked and with that I fixed the buggy extractor.

Like that:

{
“bb39”: {
“processbuffer_dump”: {
“ProcessBufferProcessor # 0”: "source: server | message: server pureftp-log Thu Apr 16 00:55:38 2020 1 message log i {level: 6 | gl2_remote_ip: xxxx | gl2_remote_port: 37267 | gl2_source_node: bb39 | _id: f971 | gl2_source_input: 5ac6 | facility: local4 | timestamp: 2020-04-16T00: 55: 44.000-03: 00} ",
“ProcessBufferProcessor # 1”: "source: server | message: server pureftp-log Thu Apr 16 00:36:42 2020 1 message log * i {level: 6 | gl2_remote_ip: xxxx | gl2_remote_port: 37267 | gl2_source_node: bb39 | _id: f971 | gl2_source_input: 5ac6 | facility: local4 | timestamp: 2020-04-16T00: 36: 43.000-03: 00} ",
“ProcessBufferProcessor # 2”: “idle”,
“ProcessBufferProcessor # 3”: “idle”,
“ProcessBufferProcessor # 4”: “idle”,
“ProcessBufferProcessor # 5”: “idle”
}
}
}

Thank you so much for all your support

system · May 7, 2020, 2:08am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Status Green, all systems go. How to optimize? Graylog Central (peer support)	16	4959	September 8, 2017
Graylog Cluster, Buffer process 100% stop process messages Graylog Central (peer support)	22	17072	November 28, 2018
Performance advice. I'm missing something Graylog Central (peer support) sidecar , nxlog , nodatanx	22	9223	February 22, 2019
Journal utilization is too high - process buffer 100% Graylog Central (peer support) alert , elastic	20	6118	April 11, 2022
Struggling with Graylog stopping to export to Elasticsearch Graylog Central (peer support) pipeline-rules , debuggingpl	14	2069	August 5, 2021

Process Buffer Flooding 100% process

Related topics