Hi!
Currently i am using a single node setup (everything is on that one node) and have no significant performance issues. Old production server have 8 vcpu-s and 64GB of vRAM and 8 SAS disks. One primary shard per index. Graylog 2.4 and elasticsearch 2.x.
Now i have prepared cluster of graylog 3 (two nodes) and elasticsearch 6.7 (three nodes, two of them data/master and one master only).
Also i’v reindexed from remote from old production to this new production.
Everything is running on centos 7.6. Everything except one elasticsearch data node is on vmware virtualization.
Graylog node i’m testing with have 16 vcpus and ~120GB of vRAM (other node which is not actively used have 8 vcpus and 64GB of vRAM)
Elasticsearch nodes:
es01 (vmware)
32 vcpu
64 vram
using volume on SAN with more then 100 SAS disks
es02 (phycical)
64 threads
64 ram
8 SAS disks
es03 (vmware, no data)
8 vcpu
64 vram
both openjdk-11 and openjdk-1.8.0 are installed, but JAVA_HOME set to openjdk-11 for elasticsearch (JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.2.7-0.el7_6.x86_64/)
lsof -p 48818 | grep open
java 48818 elasticsearch txt REG 253,0 11480 117456399 /usr/lib/jvm/java-11-openjdk-11.0.2.7-0.el7_6.x86_64/bin/java
java 48818 elasticsearch mem REG 253,0 18100224 8536631 /usr/lib/jvm/java-11-openjdk-11.0.2.7-0.el7_6.x86_64/lib/server/classes.jsa
…etc
It looks like problem with elasticsearch, but maybe someone can give some ideas.
New Elasticsearch environment has ~600 indices with 3 primary + 1 replica, so in total ~600*6=3600 shards and ~4.6TB(primary+replica) in total. ~ 4M-10M messages in each index (~4GB-10GB in each index).
Node which is on physical server (es02) is more or less performant though still worst then old production server. es01 performs badly on searches with very high cpu usage, load and very high ioread. Also when all nodes are running and i’m performing searches es01 ioread are tens of time larger (near 1GB/s) of that es02 have, also load is significanttly heavier. First thing to come to my mind was that elasticsearch is just using one node much more then another for some reason though number of primary and replica shards is not much different on both nodes, so i could say it is quit balanced.
I have switched of es01 for test and observed then that search is more or less performant (though still worst then on old production) and whats interesting that ioread or load did not changed on that node significantly while other data node was offline when logically it should perform all those io-s on that single node and it also should jump to ~1GB/s, so that’s probably no fault of a cluster as whole. When switching vice versa (onlu es01 online) searches was still performing very bad as with two nodes online. So it looks like that it is one node i have most of the problems with (es01).
Maybe i’m also not doing those searches in correct way because i’v imported all dashboards from old production and maybe there are some incompatibilities. Some dashboards are very slow only on that one node while other are retrieving results at acceptable speed.
EDIT: i’v also tried to describe only es02 in graylog, but that did not gave any result.
Also just few test clients are sending data to graylog inputs with few hundreds of documents per day so graylog node and elasticsearch cluster is not loaded with ingest input traffic. Also i’v added for test purposes “use_adaptive_replica_selection” : “true”.
es is running with following parameteres:
elastic+ 9309 31.4 62.7 2297341424 41300208 ? SLsl May14 319:04 /usr/lib/jvm/java-11-openjdk-11.0.2.7-0.el7_6.x86_64//bin/java -Xms30g -Xmx30g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch-13924678276891718870 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.locale.providers=COMPAT -XX:UseAVX=2 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=oss -Des.distribution.type=rpm -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet
Almost all indices have following settings:
{
“graylog_555” : {
“settings” : {
“index” : {
“refresh_interval” : “30s”,
“number_of_shards” : “3”,
“provided_name” : “graylog_555”,
“creation_date” : “1556006937036”,
“analysis” : {
“analyzer” : {
“analyzer_keyword” : {
“filter” : “lowercase”,
“tokenizer” : “keyword”
}
}
},
“number_of_replicas” : “1”,
“uuid” : “Kgc58t7CSAaPqPtCOOPfzw”,
“version” : {
“created” : “6070199”
}
}
}
}
}
Can anyone suggest on any troubleshooting possibilities and well known tuning for dummies?
Thank you and sorry for my English!
es01 https://pastebin.com/CYUrNYNW
es01 https://pastebin.com/znDpPF9b
es03 _https://pastebin.com/vwa6HYLy because of newbie limit
gl01 _https://pastebin.com/e3G4jSRH because of newbie limit