Weird Issues with two-node Cluster

Hello everyone. I set up a two-node cluster by following the documentation here. In addition to those steps I enabled elasticsearch on the primary node as well. So in summary I have one node running all services (ES, MongoDB, Graylog, etc…) and another node only running ES.

When I go to the kopf plugin site it does show both ES nodes and all shards are assigned. However, I’m having this problem now:

  • When i go to search messages and select a range of, let’s say the last hour, no messages show. But if I choose last 8 hours, all messages will show, including those coming in in real-time. By the way, this varies. Sometimes choosing messages in the last five minutes WILL show the messages correctly, and sometimes it won’t.

  • Under System/Nodes I only see my master node’s information. The secondary node (running only ES) does display but no other information is shown (i.e., no memory heap usage info). And sometimes the second node doesn’t display at all.

  • Finally, and I don’t know if this is normal and I’m just noticing now, but I see that many messages are marked as not processed. My journal never gets filled up though so I think that’s good (usually grows to 8% tops, from what I’ve seen but no more than that).

Please let me know what other information I should provide to make my problem more clear to you (logs, screenshots, whatever).

Do the messages have the correct timestamps? What do the index details on the System / Indices / Index Set page say?

Yes, because that page only displays the details of Graylog nodes, not those of Elasticsearch nodes.

Is there anything unusual in the logs of your Graylog and Elasticsearch nodes?
:arrow_right: http://docs.graylog.org/en/2.2/pages/configuration/file_location.html#omnibus-package

They do have the correct timestamp.
These are the details for the current index:

Thank you for pointing that out!

I have a few warnings like this in the Graylog server (which is also a slave ES node):

2017-07-18_13:05:35.45455 [2017-07-18 09:05:35,452][WARN ][env] [Derrick Slegers Speed] max file descriptors [64000] for elasticsearch process likely too low, consider increasing to at least [65536]

Also have a few like these in both ES nodes:
> 2017-07-18_13:05:05.29675 [2017-07-18 09:05:05,296][INFO ][cluster.service] [Veil] removed {{graylog-e12c96d2-6edc-4982-b476-a82a4238ce3c}{kjUZKScOQiOGPWFER9TD9g}{secondary_ES_NODE_IP}{secondary_ES_NODE_IP:9> 350}{client=true, data=false, master=false},}, reason: zen-disco-node_failed({graylog-e12c96d2-6edc-4982-b476-a82a4238ce3c}{kjUZKScOQiOGPWFER9TD9g}{secondary_ES_NODE_IP}{secondary_ES_NODE_IP:9350}{client=true, data=fa
> lse, master=false}), reason transport disconnected

Also noticed this from the kopf plugin page. It says that there are 3 nodes when I know for a fact there’s only two.

So when I went to the NODES tab I saw this. Notice that the first two nodes listed have the same IP although different ports. They also seem to be running different ES versions. No idea how that happened and why it’s listing two different ES nodes under the same IP:

Thank you for reply and I hope what I posted is useful.

Are you sure about that? Do you have some examples?

How exactly do you send the messages to Graylog and how is the input receiving the messages configured?

Graylog joins the Elasticsearch cluster as a client node (i. e. not master eligible, doesn’t store data). That’s what you see in the Elasticsearch cluster state with Kopf (or any other UI).
You can identify the Graylog client node with the “graylog” prefix followed by the Graylog node ID.

Here’s a screenshot I just took now. I have highlighted both the original message’s timestamp and the one that Graylog puts in it:

For that particular message (coming from a Cisco ASA) this is the configuration in the ASA:

And this is the UDP input configuration. Also, I’m using that one input for many other network devices that use syslog:

Thank you for elaborating on that. I still don’t quite understand why it lists the Graylog server as using a different ES version.

That looks correct (assuming that your Cisco ASA is configured to use timezone UTC).

Could you please elaborate on the issue you have (or had)?

Because that’s the version of Elasticsearch embedded in the version of Graylog you’re using.

Ok, let’s take the screenshot with the message above as an example:

Notice how I selected to search in the last 2 hours? If I had selected to search in the last 5 minutes, the results would’ve been no messages at all (even though the message shown happened in real-time). This issue happens randomly but it’s almost constant. Moreover, at times I have to search messages in the last day in order to see messages coming in real time.

Never had this issue until I joined the new ES node. When it was just one node, the search function worked fine.

Please let me know if you need any more specific details and thanks again for your help.

Which timezone is the system running Graylog on?
Which timezone is the system running Elasticsearch on?
Which timezone is configured for the Graylog user you’re logged in with?
Which timezone is configured on your Cisco ASA devices?

I believe they’re all in the same timezone but I’ll post screenshots, just in case.

UPDATE: Just found out that recalculating the indexes manually (Indices>Maintenance>Recalculate index ranges) fixes the problem temporarily. It fails again in exactly 5 minutes after the indexing task is finished (because I’m searching messages in the last 5 minutes of course). If I choose to search in the last 15 minutes, then messages will show…until 15 minutes from the last indexing have passed.

Any thoughts?

At least you should increase the number of file descriptors, as the logs suggest. Those broken connections may well result from that reason. You could use something much larger, such as 260000.

Also, check using the kopf, what is the memory usage in the ES nodes; just to be sure.

Also, I’ll look into increasing the number of file descriptors.

I found out someone having the same issue here. For one of the people in the thread, manually cycling the deflector fixed their issues. Do you know what unexpected consequences that may have?

hi,

if you have set up retention based on number of indices, you might lose data sooner than you intended. You can counter this by having a few indices spare in the retention settings, so that you have the possibility to manually cycle the deflector index.

Hi and thanks for your reply. These are my current retention settings. What would suggest I change? Sorry I’m not too clear on all this. Never really had problems before with Graylog.

That depends on your policy.

For example, if your policy is to hold data for 30 days, you could use index rotation based on time, and then choose some time (such as 1 day or 4 hours or whatever is convenient for you) as retention time. Then max number of indices would be 30 days / rotation time + some spare for manual deflector cycling.

Here your policy seems to be to hold the last 800,000,000 messages. Then, to have spare, you can add some additional indices to the max number of indices, so that you have some spare for deflector cycling.

For example, if you add 3 to the max number of indices, you can manually cycle your deflector 3 times without breaking your company policy.

Got it, thanks for the explanation.

FYI, I manually cycled the deflector early this afternoon and the problem seems to be gone. Thank you both for your useful insights.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.