MongoDB or Elasticsearch crash

Hello,
I run as a proof of concept the last OVA version.

I start to collect a few windows server with nxlog - about 1000mgs/s.

I have a 8 CPU/ 100Go RAM & 1.5TB VM. All services are running on the same VM as it’s provided by the OVA.

For an unknown reason I have elasticsearch or mongodb going to down for time to time. It can run without a scratch for 8 days and all of a sudden, stop working.

I had a look on /var/log/graylog by I did not find any clue.

Do you have an idea to where I should start debuging?

Edit : It happened this night around 1000PM. Messages go in but not out, the mongodb service was down ( info from graylogctl-status). When I restarted all services, I had no messages out. I had to delete the journal folder in order to get the services running back.

you might want to check your logfiles and check for the available diskspace

http://docs.graylog.org/en/2.4/pages/configuration/file_location.html#omnibus-package

I have space available.

I had a look on this page already and I found my log files, but I don’t know which one to open actually.

I have all of these, which one should I open first?

The latest log entries are in the file named current.

Also see https://cr.yp.to/daemontools/multilog.html for details about the other log files.

Thank you :slightly_smiling_face:

I was able to viewed what happened.

For ES :
2018-02-26_06:26:57.17170 vm.max_map_count = 262144
2018-02-26_06:26:57.30152 Java HotSpot™ 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fe7c4990000, 62669651968, 0) failed; error=‘Cannot allocate memory’ (errno=12)
2018-02-26_06:26:57.30202 #
2018-02-26_06:26:57.30251 # There is insufficient memory for the Java Runtime Environment to continue.
2018-02-26_06:26:57.30336 # Native memory allocation (mmap) failed to map 62669651968 bytes for committing reserved memory.
2018-02-26_06:26:57.30376 # Can not save log file, dump to screen…

For MongoDB - quite surprise as df -h tells me I still have 5% free.
2018-03-06_00:12:55.23096 2018-03-06T01:12:55.145+0100 W FTDC [ftdc] Uncaught exception in ‘FileStreamFailed: Failed to write to interim file buffer for full-time diagnostic data capture: /var/opt/graylog/data/mongodb/diagnostic.data/metrics.interim.temp’ in full-time diagnostic data capture subsystem. Shutting down the full-time diagnostic data capture subsystem.
2018-03-06_00:12:59.57949 2018-03-06T01:12:59.579+0100 E STORAGE [thread2] WiredTiger error (28) [1520295179:577723][4623:0x7f6135c92700], file:WiredTiger.wt, WT_SESSION.checkpoint: /var/opt/graylog/data/mongodb/WiredTiger.turtle.set: handle-write: pwrite: failed to write 1014 bytes at offset 0: No space left on device
2018-03-06_00:12:59.61032 2018-03-06T01:12:59.582+0100 E STORAGE [thread2] WiredTiger error (28) [1520295179:582663][4623:0x7f6135c92700], file:WiredTiger.wt, WT_SESSION.checkpoint: /var/opt/graylog/data/mongodb/WiredTiger.turtle.set: handle-write: pwrite: failed to write 1014 bytes at offset 0: No space left on device
2018-03-06_00:12:59.67587 2018-03-06T01:12:59.589+0100 E STORAGE [thread2] WiredTiger error (0) [1520295179:582743][4623:0x7f6135c92700], file:WiredTiger.wt, WT_SESSION.checkpoint: WiredTiger.turtle: encountered an illegal file format or internal value
2018-03-06_00:12:59.67626 2018-03-06T01:12:59.589+0100 E STORAGE [thread2] WiredTiger error (-31804) [1520295179:589780][4623:0x7f6135c92700], file:WiredTiger.wt, WT_SESSION.checkpoint: the process must exit and restart: WT_PANIC: WiredTiger library panic
2018-03-06_00:12:59.67669 2018-03-06T01:12:59.605+0100 I - [thread2] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361
2018-03-06_00:12:59.67799 2018-03-06T01:12:59.605+0100 I - [thread2]
2018-03-06_00:12:59.67848
2018-03-06_00:12:59.74972 ***aborting after fassert() failure

I will keep checking these log if it crashes again and check what happened.

I had again the issue yesterday around 11:00PM.

I had no output message processing, all services were up. I had 10% of free memory - swap partition was full. I had about 200Go free of 1.4Tb drive.

I have this warning in ES log :
2018-03-14_07:55:09.71565 [WARN ][o.e.c.r.a.DiskThresholdMonitor] [YrJKTCm] high disk watermark [90%] exceeded on [YrJKTCmrTTmmWH1eg-b2rg][YrJKTCm][/var/opt/graylog/data/elasticsearch/nodes/0] free: 91.1gb[6%], shards will be relocated away from this node
_2018-03-14_07:55:39.71665 [WARN ][o.e.c.r.a.DiskThresholdMonitor] [YrJKTCm] high disk watermark [90%] exceeded on [YrJKTCmrTTmmWH1eg-b2rg][YrJKTCm][/var/opt/graylog/data/elasticsearch/nodes/0] free: 91.1gb[6%], shards will be relocated away from this node
_2018-03-14_07:55:39.71972 [INFO ][o.e.c.r.a.DiskThresholdMonitor] [YrJKTCm] rerouting shards: [high disk watermark exceeded on one or more nodes]

For server log, I only have .s file, which I cannot open with vim nor with a cat command.

You might want to read about how Elasticsearch allocates indices (shards) on disk:
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/disk-allocator.html

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.