Continuation of my topic Graylog heap size maximum
@Totally_Not_A_Robot, @jan
Next level of my tests.
I got additional server 256Gb/24cores/48Tb and deployed 3 ES instances there, gave 32Gb RAM, 6 cores, and separated RAID0 16Tb partition, to each ES.
So server A running only Graylog, (16Gb heap, processbuffer_processors = 16, outputbuffer_processors = 24, inputbuffer_processors = 4), server B running only ES.
And I haven’t got any significant improvement comparing to initial configuration (can proceed just about 20Km/s), then output buffer becomes full and journal is growing.
CPU/RAM/IO are used just for ~25% only, so I can’t even realize where is the bottleneck.
What logs I should check to find the bottleneck?
these processors are too much… You should use 3/4 of your cores only for these processors, but you also run ES on your server, so less…
monitor your environment, and you will catch the bottleneck.
(I don’t want answers…) eg. do you have enough fast ethernet card to handle this network traffic? Does the ES have enoght IO for write data? ES heap usage? GL buffers, and journal usage? ES indexing time? raid card IO speed? GL heap usage? etc…
Nope. I use separated servers for Graylog and ES…
EDIT: Nevermind, I think I misread your post. I thought you were using one box to run four VMs, but it looks like you have two boxen with (1+3) VMs. Correct?
What logs I should check to find the bottleneck?
Check Elasticsearch - it will tell you something that no threads are left or similar. You simple overwhelm your ES with Data.
So basically?.. Add more ES nodes?..
At which point in time is three nodes not going to be enough anymore? Interesting question…
I have
- 1 server(256Gb/56cores/18Tb) for Graylog.
- 1 server(256/24core/48Tb(split into 3 RAID0 partitions) for 3 ES in docker containers.
My confusion is that I can’t recognize the bottleneck - according to Cerebro there is no high pressure to ES, it’s CPU is no higher than 11%.
And comparing to my initial configuration (Graylog and ES on the same server) performance was not improved at all. That’s pretty weird
At which point in time is three nodes not going to be enough anymore? Interesting question…
only based on the numbers we have here we can only assume what might be the issue.
My statement came simple deep out of my guts - based on what I have seen in the wild.
Without checking Elasticsearch Logs, checking what is inside the JVM no real answer can be given. But trust me, ingesting 20k messages per second with a refresh_rate of 1sec (default) will need more.
@zoulja we are talking about JVM Software - that is not always measured by Load, CPU or RAM usage - you need to monitor the JVM inside.
So far I checked Cerebro info deeply and found LA is pretty high.
It seems splitting into 3 RAIDs didn’t help and IO is the bottleneck
Ahh! You’re saying that the CPU usage for user processes was low, but now you notice a higher load average? As in, you’re saying that your CPUs are stuck in IOWAIT a lot? Man that’s a shame.
Not exactly…
I never used Docker before and installed it just for this test.
I saw that CPU usage was low. But LA is high, so I assume that’s IO issue, but I don’t know how to check Docker deeper
As for now I see that LA is low, CPU is low, but injection is extremely slow and unstable…
Totally I got worst performance than on single server.
I give up, this solution is too hard to scale and maintain so I have no arguments to recommend it to my company
I guess you mixed up with something containing four products you do not know. First of you did not know Docker and then you decided to run a complex environment like Graylog (containing MongoDB and Elasticsearch) to run inside of a Docker Network/Host.
After that was successful you try to push that environment to its limits - succesful - and complain about that is does not push that far as you hoped. But to what did you compare that?
The solution would be - setup the environment (read Graylog, Elasticsearch) in an infrastructure you understand and are able to controll. Then do you load tests.
I guess from all you wrote in this community that your docker host is even virtualized what makes it more complicated. Did you understand the layers that are between the application and your hardware?
- Server
- Host OS
- Hypervisor
- VM
- Guest OS
- Docker Engine
- Container
- JVM
- Application
- JVM
- Container
- VM
All pieces have influence how the application in the JVM behave, do not ‘guess’ that the final application is not working in Docker because you can’t controll or understand all 8 Layers from your Baremetal to the Application.
Not exactly like this, @jan…
I have 2 (good enough) physical servers at my hands.
No VMs (I never said a word about VM in this topic, kmon ppl), just old good metal power.
From my first topic I came to assumption 1) need to separate Graylog and ES 2) I should go with horizontal scaling, since modern (Java based) software can’t utilize single powerful server good enough, that’s why I deployed 3 ES Docker containers.
And even I never used it before, my setup is really really simple, so I’m 99% sure no super fine tuning is needed here (I don’t believe in “JVM memory management black magic”)
Anyway I will try once again and will avoid Docker: 1 PM and all resources for Graylog and 1 PM and all resources to ES.
That’s where I went wrong. In your first post in this thread you wrote:
I got additional server 256Gb/24cores/48Tb and deployed 3 ES instances there, gave 32Gb RAM, 6 cores, and separated RAID0 16Tb partition, to each ES.
Which I understood to mean that you followed my/our advice of running ES in separate VMs. But apparently you didn’t and it’s all on one host OS with three times Elastic inside Docker on the same kernel. Yeah, that’s one Linux kernel handling the I/O for three Elastic instances. That could affect your IOWAIT times…
I’m no expert, but I can imagine that it’s a different world when you use the same hardware to run three VMs with each running one Elastic. Of course it’s the same hardware bottleneck towards your disks, but perhaps a VM layer like ESX can handle the I/O better than pushing it all through one Linux kernel.
Ahh, no… That’s where you’re going back to your first case and your previous thread. That brings back the memory limitations that you ran into the first time. ElasticSearch will not play happily with 128GB RAM for one instance.
You could for fun and profit try:
- Phys1
- ESX / KVM / ProxMox / Hyper-V
- Graylog1
- Graylog2, if you want more fun later
- Graylog3, if you want more fun later
- ESX / KVM / ProxMox / Hyper-V
- Phys2
- ESX / KVM / ProxMox / Hyper-V
- ES1
- ES2
- ES3
- ESX / KVM / ProxMox / Hyper-V
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.