So I just had multiple major cluster-wide crashes of Graylog while configuring a CSV Lookup table.
Update: I no longer get this error message about parsing the CSV, the nodes simply spin.
2018-01-25T10:29:29.707-06:00 ERROR [CSVFileDataAdapter] Couldn't parse CSV file /etc/graylog/ip-to-subnet-csv (settings separator=<,> quotechar=<'> key_column=<ipaddr> value_column=<network>)
java.lang.ArrayIndexOutOfBoundsException: 1
at org.graylog2.lookup.adapters.CSVFileDataAdapter.parseCSVFile(CSVFileDataAdapter.java:156) [graylog.jar:?]
at org.graylog2.lookup.adapters.CSVFileDataAdapter.doStart(CSVFileDataAdapter.java:91) [graylog.jar:?]
at org.graylog2.plugin.lookup.LookupDataAdapter.startUp(LookupDataAdapter.java:59) [graylog.jar:?]
at com.google.common.util.concurrent.AbstractIdleService$DelegateService$1.run(AbstractIdleService.java:62) [graylog.jar:?]
at com.google.common.util.concurrent.Callables$4.run(Callables.java:122) [graylog.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Then the nodes sit and spin like thisā¦
2018-01-25T10:40:37.706-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:40:45.110-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:40:52.719-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:41:02.603-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:41:07.410-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:41:20.178-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:41:28.923-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:42:04.614-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:42:19.772-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:42:38.571-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:43:29.526-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:45:45.185-06:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.
I actually had to get into MongoDB and manually delete the LUT entry in the collection, and systemctl restart graylog-server on every node.
Iām alarmed that my simple configuration turned it to dust. Is there something Iām missing? My CSV is loaded in the same place on every nodeā¦
# ls -lh /etc/graylog
-rw-r--r--. 1 root root 250M Jan 25 10:30 ip-to-subnet-csv
# wc -l /etc/graylog/ip-to-subnet-csv
8250428 /etc/graylog/ip-to-subnet-csv
# file -bi /etc/graylog/ip-to-subnet-csv
text/plain; charset=us-ascii
Note that file -bi will return us-ascii as itās a UTF-8 subset; my call in Python followsā¦note that Iām not using āutf-8-sigā as the BOM appears to not be recognized by Graylog.
o = codecs.open("/etc/graylog/ip-to-subnet-csv", "w", "utf-8")
Looks like thisā¦
'ipaddr','network'
'10.81.2.0','10.81.2.0/24'
'10.81.2.1','10.81.2.0/24'
'10.81.2.2','10.81.2.0/24'
'10.81.2.3','10.81.2.0/24'
'10.81.2.4','10.81.2.0/24'
'10.81.2.5','10.81.2.0/24'
'10.81.2.6','10.81.2.0/24'
'10.81.2.7','10.81.2.0/24'
'10.81.2.8','10.81.2.0/24'
'10.81.2.9','10.81.2.0/24'
'10.81.2.10','10.81.2.0/24'
'10.81.2.11','10.81.2.0/24'
'10.81.2.12','10.81.2.0/24'
'10.81.2.13','10.81.2.0/24'
'10.81.2.14','10.81.2.0/24'
'10.81.2.15','10.81.2.0/24'
'10.81.2.16','10.81.2.0/24'
......
Configured like soā¦
File path : /etc/graylog/ip-to-subnet-csv
Check interval: 3600
Separator: ,
Quote character: '
Key column: ipaddr
Value column: network
Allow case-insensitive lookups: true
Is this not all ok? Is my CSV file too big? Do I need more than 2GB of heap for Graylog to accommodate a file this size? Iām not even caching it since itās a local file.
How I got here - normally I would query our Infoblox directly for this information. However, for every single IP address, this is very, very slow to do - jams up my pipeline immensely. In order to speed this up, I figured I could use a Python script on the master node to query for a list of all the existing networks (much faster, especially all in one call) and then use Python to generate a full listing of IPs to subnets in a proper CSV file, then SCP to the other nodes (I have this all cronād and automated; updates daily). This way I have all the information needed for the lookup pre-loaded onto the Graylog nodes for fast access.
Anyone that can help solve this mystery, your help would be greatly appreciated! @jochen? @lennart?
Update: I severely truncated the file (down to a few dozen lines) and that appears to work, so formatting etc doesnāt appear to be the issue. By the look of the error message, it appears that the LUT mechanism is jamming it all into an array (of which we should have max size of roughly 2.1 billion). The total line size above of about 8.25 million lines should fit well within that threshold. Which leaves me utterly perplexed at this point.