Graylog 2.4.3 Crash from Lookup Table


(Jason Keller) #1

So I just had multiple major cluster-wide crashes of Graylog while configuring a CSV Lookup table.

Update: I no longer get this error message about parsing the CSV, the nodes simply spin.

2018-01-25T10:29:29.707-06:00 ERROR [CSVFileDataAdapter] Couldn't parse CSV file /etc/graylog/ip-to-subnet-csv (settings separator=<,> quotechar=<'> key_column=<ipaddr> value_column=<network>)
java.lang.ArrayIndexOutOfBoundsException: 1
        at org.graylog2.lookup.adapters.CSVFileDataAdapter.parseCSVFile(CSVFileDataAdapter.java:156) [graylog.jar:?]
        at org.graylog2.lookup.adapters.CSVFileDataAdapter.doStart(CSVFileDataAdapter.java:91) [graylog.jar:?]
        at org.graylog2.plugin.lookup.LookupDataAdapter.startUp(LookupDataAdapter.java:59) [graylog.jar:?]
        at com.google.common.util.concurrent.AbstractIdleService$DelegateService$1.run(AbstractIdleService.java:62) [graylog.jar:?]
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:122) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]

Then the nodes sit and spin like this…

2018-01-25T10:40:37.706-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:40:45.110-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:40:52.719-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:41:02.603-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:41:07.410-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:41:20.178-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:41:28.923-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:42:04.614-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:42:19.772-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:42:38.571-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:43:29.526-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-01-25T10:45:45.185-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.

I actually had to get into MongoDB and manually delete the LUT entry in the collection, and systemctl restart graylog-server on every node.

I’m alarmed that my simple configuration turned it to dust. Is there something I’m missing? My CSV is loaded in the same place on every node…

# ls -lh /etc/graylog
-rw-r--r--. 1 root root 250M Jan 25 10:30 ip-to-subnet-csv

# wc -l /etc/graylog/ip-to-subnet-csv
8250428 /etc/graylog/ip-to-subnet-csv

# file -bi /etc/graylog/ip-to-subnet-csv
text/plain; charset=us-ascii

Note that file -bi will return us-ascii as it’s a UTF-8 subset; my call in Python follows…note that I’m not using ‘utf-8-sig’ as the BOM appears to not be recognized by Graylog.

o = codecs.open("/etc/graylog/ip-to-subnet-csv", "w", "utf-8")

Looks like this…

'ipaddr','network'
'10.81.2.0','10.81.2.0/24'
'10.81.2.1','10.81.2.0/24'
'10.81.2.2','10.81.2.0/24'
'10.81.2.3','10.81.2.0/24'
'10.81.2.4','10.81.2.0/24'
'10.81.2.5','10.81.2.0/24'
'10.81.2.6','10.81.2.0/24'
'10.81.2.7','10.81.2.0/24'
'10.81.2.8','10.81.2.0/24'
'10.81.2.9','10.81.2.0/24'
'10.81.2.10','10.81.2.0/24'
'10.81.2.11','10.81.2.0/24'
'10.81.2.12','10.81.2.0/24'
'10.81.2.13','10.81.2.0/24'
'10.81.2.14','10.81.2.0/24'
'10.81.2.15','10.81.2.0/24'
'10.81.2.16','10.81.2.0/24'
......

Configured like so…

File path : /etc/graylog/ip-to-subnet-csv
Check interval: 3600
Separator: ,
Quote character: '
Key column: ipaddr
Value column: network
Allow case-insensitive lookups: true

Is this not all ok? Is my CSV file too big? Do I need more than 2GB of heap for Graylog to accommodate a file this size? I’m not even caching it since it’s a local file.

How I got here - normally I would query our Infoblox directly for this information. However, for every single IP address, this is very, very slow to do - jams up my pipeline immensely. In order to speed this up, I figured I could use a Python script on the master node to query for a list of all the existing networks (much faster, especially all in one call) and then use Python to generate a full listing of IPs to subnets in a proper CSV file, then SCP to the other nodes (I have this all cron’d and automated; updates daily). This way I have all the information needed for the lookup pre-loaded onto the Graylog nodes for fast access.

Anyone that can help solve this mystery, your help would be greatly appreciated! @jochen? @lennart?

Update: I severely truncated the file (down to a few dozen lines) and that appears to work, so formatting etc doesn’t appear to be the issue. By the look of the error message, it appears that the LUT mechanism is jamming it all into an array (of which we should have max size of roughly 2.1 billion). The total line size above of about 8.25 million lines should fit well within that threshold. Which leaves me utterly perplexed at this point.


Mapping IPs to subnets
(Jochen) #2

A CSV file read by the CSV adapter, no matter how corrupt or big, should never crash the Graylog node it’s running on.

Please make sure to create a bug report with all the information you have on https://github.com/Graylog2/graylog2-server/issues.

From a technical standpoint, the CSV File adapter currently reads in the complete CSV file and tries to parse it to put it into an in-memory structure. So yes, 2 GB of heap memory might be too little for a 256 MB big CSV file, when it has to suffice for all other things that Graylog is doing. Personally, I consider this a bug.

Depending on what you want to achieve, you might want to take a look at the to_ip() and cidr_match() functions.


(Jason Keller) #3

@jochen

I noticed something curious late yesterday about the behavior of this - it pegs one core, and only one core, while it appears to completely stop responding. This screamed one thing loudly to me - constant garbage collection, with stop-the-world consequences. I suspect that all the memory that these tables are loaded into are heap memory.

So after taking this all to my lab (which has about 4-7 msgs/s, vs the 4000-8000 msgs/s in prod), and starting with a heap size double what we have in prod (2GB instead of the default 1GB), I noticed that initially the nodes looked like they were very agitated, and about to stop responding, but then after a minute or two, they seemed to stabilize, however the CPUs were significantly busier and the heap never went below 1.6GB even after a GC. Overall responsiveness seemed sluggish.

So I increased the node memory from 4 > 6 GB and the HEAP memory from 2 > 4 GB. Upon rebooting the nodes, they seemed far happier, CPU is back where it should be and the HEAP size post-GC is about 1.9-2.1GB. Lookups to the CSV seem imperceptibly quick, which is good. The bad news is how much HEAP memory this can eat, clearly about 6 times the size you see on disk even uncompressed.

I’m trying to get more memory in our production nodes today to see if I have the same results (I expect so). So right now it looks like the best path forward is dire warnings via clear, loud documentation for sizing HEAP memory to accommodate beastly CSV files until a guard rail of some sort can be constructed to prevent enthusiastic fools like me from skidding off the track.

I had also been thinking for a long time about trying to achieve this a different way, but even with the to_ip() and cidr_match() functions, I can’t think of another way in Graylog to take an IP and convert it into a subnet (you still need a full list of all subnets to compare against, as the masks are variable and cannot be divined from other means). I suppose a custom lookup table engine could be made that would take an IP against a list of subnets using a CIDR comparison function with a loop, but this is beyond the scope of pipelines and basic lookup tables.

I suppose what would ultimately supplant this whole thing is making a custom lookup table function that would ingest the ${key} element and take it through a cidr_match() function against a simple list of subnets to see which it would match against, instead of having to parse out every single IP address manually. We could pre-harvest all the network elements with a single call, or preload a CSV table (right now I’m using a python script to grab all network addresses with a single WAPI call to Infoblox).

I’m just thinking out loud here. Please let me know what you think. Personally I’d love to see a lookup table function like that in core, as I’d imagine in large networks like ours this would be a common issue (one that I’m at the moment using a hammer to solve instead of a scalpel).


(Jochen) #4

What are you exactly trying to achieve? Can you provide some examples?

For reference:


(Jason Keller) #5

@jochen

So in Infoblox we have a lot of information (in this case, dhcp domain-name) that we are trying to enrich messages with. Unfortunately, I have two issues…

First, querying Infoblox for each and every IP over WAPI right now is horrendously slow (read: 8000msgs/s > 12 msg/s). In other words, not sustainable.

Second, even if I could, that particular attribute is only on the network subnet, and isn’t heritable to the actual IP address object when I query it. And it isn’t the only non-heritable attribute in this instance.

So right now I’m stuck doing two lookups - one to identify the network the IP belongs to, the next querying Infoblox for the information that I want out of that network object.

Once I have the network identified, I can pull all sorts of interesting and crazy stuff I’m about to put in Infoblox (vertical, customer, VLAN, etc). This also has the side-effect of quelling additional lookups to Infoblox, since each IP will resolve to its respective subnet, so when one IP hits that network and we get that information, the next IP in the subnet doesn’t hit Infoblox for it, it hits cache, majorly increasing our cache hit rate and making our lookups thousands of times more efficient.


(Jason Keller) #6

@jochen

I’ve now gotten this running in our production environment. I’ve actually had to increase each node to 7GB w/ 4GB HEAP, so far it appears to be working (6GB w/ 4GB HEAP actually ended up swapping and started choking nodes). I will let you know if that changes.


(Jason Keller) #7

@jochen

Strangely, while this all seems to work ok so far, I’m still apparently running foul of something occasionally…

Node1
2018-02-02T01:43:21.754-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:43:31.475-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:43:42.296-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:43:52.041-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:44:00.143-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.

Node2
2018-02-02T01:08:51.809-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:09:01.837-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:09:11.115-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:09:22.575-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:09:33.206-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:09:45.484-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:09:52.076-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.

Node3
2018-02-02T01:20:23.526-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:20:33.840-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:20:44.022-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:20:54.866-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:21:09.321-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-02-02T01:21:18.187-06:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.

I’ve noticed that I’m still really close on memory…

top - 10:37:35 up 4 days,  2:55,  1 user,  load average: 1.74, 1.42, 1.32
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.5 us,  2.3 sy,  0.0 ni, 86.8 id,  0.1 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem :  6960820 total,   221064 free,  5888820 used,   850936 buff/cache
KiB Swap:  1047548 total,  1015612 free,    31936 used.   337664 avail Mem

I haven’t found any other errors, so I’m pretty sure memory pressure is causing some operations in MongoDB to take a bit longer than normal (currently taking 592MB for MongoDB). Biggest question is, how is it continuing to take so much memory?


(Jason Keller) #8

@jochen @lennart

So after further increasing memory, I’ve now caught the meta info errors happening without any swapping to disk; simply higher CPU usage. However, most notably I’m see this crop up usually during periods where the CSV file Data Adapter is refreshing after changing on disk. Is the loading thread for this a stop-world event while it’s being processed into the heap?


(system) #9

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.