Use of ${QUOTEDSTRING} causes Graylog to seize - Part II

1. Describe your incident:
We are creating a grok pattern to parse a complex message that contains quoted strings with commas inside the quotes. Whenever we use the default ${QUOTEDSTRING} or ${QS} grok patterns against these logs, Graylog ceases the indexing of all logs to OpenSearch until the Graylog server is rebooted. The problem does not occur when the use of ${QUOTEDSTRING} or ${QS} is removed from the grok pattern.

2. Describe your environment:
We’re running Graylog 6.1 and OpenSearch 2.15 in a two-server configuration on a security hardened Oracle Linux 8 OS.

3. What steps have you already taken to try and solve the problem?

  • We found nothing in the Graylog server logs or OpenSearch cluster logs related to the issue.
  • We created a our own pattern called ${QUOTEDSTR} with the regex “(.*?” but using this pattern also caused Graylog to seize with the exact same symptoms.

4. How can the community help?

  • How can we grok quoted strings without using ${QUOTEDSTRING} or ${QS}?
  • How can we use ${QUOTEDSTRING} or ${QS} without Graylog seizing?

Howdy!

Are you able to provide both your grok patten and a couple of sample texts that the grok pattern is meant to be applied to?

I ask because GROK (for the purposes of this reply i’ll be calling this regular expressions) has a very large degree of variability with how the patterns are constructed and the text it parses can greatly affect how it performs. Use off .*? can be very expensive (CPU wise) due to backtracking which is to say the regular expression parser goes to the END of the string and tests each character one by one. If the text is long, this can be VERY CPU intensive.

This can be demonstrated using a regular expression debugger, like the one provided via https://regex101.com/ that shows the number of steps required to have the regular expression parse the text.

Lastly, can you further quantify or qualify what “causes Graylog to seize” means? In technical terms, does this mean the ALL CPUs available to graylog are at 100% utilization and graylog stops ingesting messages? I would also expect the process buffer to become 100% filled in this scenario. Depending on message throughput (both messages per second AND total size of messages per second) the Graylog server(s) may need additional CPU but i’d like to debug the regex pattern to see if there is an easy fix there.

Thanks!

Sure - here is a sanitized version:

Pattern:

%{WORD:logsrc}.div.company.com %{NUMBER:num},%{DATA:receive_time},%{NUMBER:serial},%{WORD:type},%{WORD:subtype},%{NUMBER:port},%{DATA:time_generated},%{DATA:src},%{DATA:dst},%{DATA:natsrc},%{DATA:natdst},%{DATA:rule},%{DATA:srcuser},%{DATA:dstuser},%{DATA:app},%{DATA:vsys},%{DATA:from},%{DATA:to},%{DATA:inbound_if},%{DATA:outboundif},%{DATA:logset},%{DATA:unknown_time},%{NUMBER:sessionid},%{NUMBER:repeatcnt},%{NUMBER:sport},%{NUMBER:dport},%{NUMBER:natsport},%{NUMBER:natdport},%{DATA:flags},%{DATA:proto},%{DATA:action},%{NUMBER:bytes},%{NUMBER:bytes_sent},%{NUMBER:bytes_received},%{NUMBER:packets},%{DATA:start},%{NUMBER:elapsed},%{DATA:category},%{DATA:unknown_field1},%{DATA:seqno},%{DATA:actionflags},%{DATA:srcloc},%{DATA:dstloc},%{DATA:unknown_field2},%{DATA:pkts_sent},%{DATA:pkts_received},%{DATA:session_end_reason},%{DATA:dg_hier_level_1},%{DATA:dg_hier_level_2},%{DATA:dg_hier_level_3},%{DATA:dg_hier_level_4},%{DATA:vsys_name},%{DATA:device_name},%{DATA:action_source},%{DATA:src_uuid},%{DATA:dst_uuid},%{DATA:tunnelid},%{DATA:monitortag},%{DATA:parent_session_id},%{DATA:parent_start_time},%{DATA:tunnel},%{NUMBER:assoc_id},%{NUMBER:chunks},%{NUMBER:chunks_sent},%{NUMBER:chunks_received},%{DATA:rule_uuid},%{DATA:http2_connection},%{DATA:link_change_count},%{DATA:policy_id},%{DATA:link_switches},%{DATA:sdwan_cluster},%{DATA:sdwan_device_type},%{DATA:sdwan_cluster_type},%{DATA:sdwan_site},%{DATA:dynusergroup_name},%{DATA:xff_ip},%{DATA:src_category},%{DATA:src_profile},%{DATA:src_model},%{DATA:src_vendor},%{DATA:src_osfamily},%{DATA:src_osversion},%{DATA:src_host},%{DATA:src_mac},%{DATA:dst_category},%{DATA:dst_profile},%{DATA:dst_model},%{DATA:dst_vendor},%{DATA:dst_osfamily},%{DATA:dst_osversion},%{DATA:dst_host},%{DATA:dst_mac},%{DATA:container_id},%{DATA:pod_namespace},%{DATA:pod_name},%{DATA:src_edl},%{DATA:dst_edl},%{DATA:hostid},%{DATA:serialnumber},%{DATA:src_dag},%{DATA:dst_dag},%{DATA:session_owner},%{DATA:high_res_timestamp},%{DATA:nssai_sst},%{DATA:nssai_sd},%{DATA:subcategory_of_app},%{DATA:category_of_app},%{DATA:technology_of_app},%{DATA:risk_of_app},%{QUOTEDSTRING:characteristic_of_app},%{DATA:container_of_app},%{DATA:tunneled_app},%{DATA:is_saas_of_app},%{DATA:sanctioned_state_of_app},%{DATA:offloaded},%{DATA:traffic_type},

Message:

panorama.div.company.com 1,2024/12/02 15:18:02,024101003988,TRAFFIC,end,2817,2024/12/02 15:18:02,10.101.19.7,10.101.159.130,0.0.0.0,0.0.0.0,inside-in_46,msrpc-base,vdiv1,PROD.Internal,PROD.DIVNET,ethernet1/2,ethernet1/1,default,2024/12/02 15:18:02,655765,1,61636,49681,0,0,0x401a,tcp,allow,4896,3798,1098,16,2024/12/02 15:17:19,29,any,7432514093394733498,0x8000000000000000,10.0.0.0-10.255.255.255,10.0.0.0-10.255.255.255,9,7,tcp-rst-from-client,20,11,0,0,PA2.NJ17,from-policy,0,0,N/A,0,0,0,0,1d654101-d3e3-774e-df69-82f43a334666,0,0,2024-12-02T15:18:02.940-05:00,infrastructure,networking,network-protocol,2,“has-known-vulnerability,tunnel-other-application,pervasive-use”,msrpc,untunneled,no,no,0,NonProxyTraffic,

It’s a Palo Alto Networks PANOS TRAFFIC log. The “characteristic_of_app” field is a long-ish quoted string that includes commas within the quotes. When we use the Test with Sample Data button it groks the sample log successfully, but whenever we save the grok pattern with %{QUOTEDSTRING:characteristic_of_app} defined, the output meter goes to zero and no further logs are indexed to OpenSearch, even though the input meter still runs as usual. We noticed using the top command that CPU utilization is normally 20-30% but when “seized” it will jump to over 400%. If we remove %{QUOTEDSTRING:characteristic_of_app} from the grok pattern and then either reboot or restart the graylog-server service, the output to OpenSearch will resume as normal.

When I looked at the first time you reported this I found that the grok pattern simply did not match the message. The mismatch occurred before the quoted string pattern was encountered.

Have you verified that the match actually failed at quoted string?

To narrow this down, can you reproduce this with a shorter test message and simpler grok pattern?

Thank you @marziglt much appreciated.

Can you share the context that this grok pattern is used? For example, via an extractor? Via a pipeline rule?

Also to expand on what Patrick said, it appears the Grok pattern does not match the sample log message, specifically it matches up to NUMBER:natsport

%{WORD:logsrc}.div.company.com %{NUMBER:num},%{DATA:receive_time},%{NUMBER:serial},%{WORD:type},%{WORD:subtype},%{NUMBER:port},%{DATA:time_generated},%{DATA:src},%{DATA:dst},%{DATA:natsrc},%{DATA:natdst},%{DATA:rule},%{DATA:srcuser},%{DATA:dstuser},%{DATA:app},%{DATA:vsys},%{DATA:from},%{DATA:to},%{DATA:inbound_if},%{DATA:outboundif}%{DATA:logset},%{DATA:unknown_time},%{NUMBER:sessionid},%{NUMBER:repeatcnt},%{NUMBER:sport},%{NUMBER:dport},%{NUMBER:natsport},

which matches

panorama.div.company.com 1,2024/12/02 15:18:02,024101003988,TRAFFIC,end,2817,2024/12/02 15:18:02,10.101.19.7,10.101.159.130,0.0.0.0,0.0.0.0,inside-in_46,msrpc-base,vdiv1,PROD.Internal,PROD.DIVNET,ethernet1/2,ethernet1/1,default,2024/12/02 15:18:02,655765,1,61636,49681,0,0,

The next part of the grok pattern %{NUMBER:natdport}, fails to match the next part of the message which is 0x401a. It seems the grok pattern is off by one? The text its trying ot match appears to be the next grok pattern: %{DATA:flags} which would correctly match 0x401a.

I’m not sure if this is the cause of your issue though.

For what its worth, Graylog Illuminate parses Palo Alto logs out of the box.

Thanks,
Drew

We don’t have a shorter message but we encounter the exact same issue with similar fields in the PANOS THREAT and SYSTEM logs (separate patterns/rules), with the same symptoms - if we add %{QUOTEDSTRING} to the grok pattern, all indexing ceases until it’s removed and restarted.

It’s implemented via a pipeline rule. In a previous stage, we determine and set the type of log (TRAFFIC|THREAT|SYSTEM|CONFIG|HIPMATCH|CORRELATION|ALERT|APP|AUTH|SAML|USERID) as palo_type and then in the next stage we use multiple rules to key on palo_type and apply the appropriate grok pattern using Extract grok to fields via the Rule Builder.

There are similar quoted string fields in the THREAT and SYSTEM log types and they also produce the same symptoms whenever %{QUOTEDSTRING} is added to the grok pattern.

As for the pattern not matching, that’s strange - in Graylog it works (and aligns) via the Test with Sample Data button all the way through to risk_of_app field (just before the quoted string field characteristic_of_app). We keep the pattern truncated at that point and have been processing the logstream that way for weeks now.

Today we were able to successfully implement a grok pattern on the PANOS SYSTEM log type which also has a quoted string containing commas as a field called opaque. Instead of using %{QUOTEDSTRING:opaque}, we used “%{DATA:opaque}”, which works fine with no issues.

When we applied the same solution to the grok pattern of TRAFFIC logs using “%{DATA:characteristic_of_app}”, the seize problem returned until that part of the pattern was removed and the graylog-server service restarted. We even tried “%{DATA:char_of_app}”, in case the field name was too long, but same result.

One notable difference between these two log types is that in a five minute time-frame, Graylog will receive via input around five SYSTEM log messages and over 50,000 TRAFFIC log messages. It’s able to handle anything we’ve thrown at it with ease, including the majority of the TRAFFIC grok pattern, but it seizes whenever we try to expand the pattern to grok the characteristic_of_app field, regardless of method.

Today we expanded the grok pattern for the PANOS THREAT log type, which includes a quoted string field with commas named url_category_list which we attempted to grok as “%{DATA:url_category_list}”, and which worked when using the Test with Sample Data button. We normally input around 20,000 THREAT messages in a five-minute period.

Shortly after saving the grok pattern, Graylog seized in the same fashion - using the top command we noticed CPU utilization was around 600% until we removed url_category_list from the grok pattern and restarted graylog-server.

We also noticed after restarting graylog-server that on the Graylog console page when the output meter begins moving again, it will initially show an abnormally large output (thousands of messages) for about five or six frames (frame = update every two seconds) before resuming a normal level of output (hundreds of messages). It appears to be emptying out a buffer that built up during the seizure.

We left the THREAT grok pattern truncated just before the offending url_category_list field and the server is now happily puttering along at about 25-40% CPU utilization.

I don’t have a simple answer. The messages and patterns are complex; and I have no idea how your pipeline rules and stages are structured. Clearly grok-matching is failing. I can’t tell yet, whether there is a bug in the pipeline rule processor.

I think you need to start with the basics and work up from there:

  • The sample pattern does not match the sample message string. There are multiple discrepancies. Please use something like https://grokdebugger.com/ to ensure that your pattern matches the sample messages.
  • Make sure the string presented to grok matches your expectation, e.g. by outputting it with the debug function from within the rule, before calling grok.
  • Validate the pipeline by applying it to test messages that you manually submit before using it in the production system.

Meanwhile I discovered a related issue: Grok processor may fail silently due to stack overflow when processing large messages.
The log then contains something like this:

2024-07-12 09:35:25,618 WARN : org.glassfish.jersey.server.ServerRuntime$Responder - An exception mapping did not successfully produce and processed a response. Logging the exception propagated to the default exception mapper.
java.lang.StackOverflowError: null

GL obviously needs to do better at bubbling up this problem. But eliminating the limitation will be more tricky.

If this is the root cause of your problem, you can mitigate it by splitting up grok extraction into several steps. Or switching to regex.

Here is a sanitized version of what we’re currently using (up to risk_of_app) that we confirmed works via Grok Debugger:

Pattern:

%{WORD:logsrc}.div.company.com %{NUMBER:num},%{DATA:receive_time},%{NUMBER:serial},%{WORD:type},%{WORD:subtype},%{NUMBER:unknown_num},%{DATA:time_generated},%{DATA:src},%{DATA:dst},%{DATA:natsrc},%{DATA:natdst},%{DATA:rule},%{DATA:srcuser},%{DATA:dstuser},%{DATA:app},%{DATA:vsys},%{DATA:from},%{DATA:to},%{DATA:inbound_if},%{DATA:outboundif},%{DATA:logset},%{DATA:unknown_time},%{NUMBER:sessionid},%{NUMBER:repeatcnt},%{NUMBER:sport},%{NUMBER:dport},%{NUMBER:natsport},%{NUMBER:natdport},%{DATA:flags},%{DATA:proto},%{DATA:action},%{NUMBER:bytes},%{NUMBER:bytes_sent},%{NUMBER:bytes_received},%{NUMBER:packets},%{DATA:start},%{NUMBER:elapsed},%{DATA:category},%{DATA:unknown_field1},%{DATA:seqno},%{DATA:actionflags},%{DATA:srcloc},%{DATA:dstloc},%{DATA:unknown_field2},%{DATA:pkts_sent},%{DATA:pkts_received},%{DATA:session_end_reason},%{DATA:dg_hier_level_1},%{DATA:dg_hier_level_2},%{DATA:dg_hier_level_3},%{DATA:dg_hier_level_4},%{DATA:vsys_name},%{DATA:device_name},%{DATA:action_source},%{DATA:src_uuid},%{DATA:dst_uuid},%{DATA:tunnelid},%{DATA:monitortag},%{DATA:parent_session_id},%{DATA:parent_start_time},%{DATA:tunnel},%{NUMBER:assoc_id},%{NUMBER:chunks},%{NUMBER:chunks_sent},%{NUMBER:chunks_received},%{DATA:rule_uuid},%{DATA:http2_connection},%{DATA:link_change_count},%{DATA:policy_id},%{DATA:link_switches},%{DATA:sdwan_cluster},%{DATA:sdwan_device_type},%{DATA:sdwan_cluster_type},%{DATA:sdwan_site},%{DATA:dynusergroup_name},%{DATA:xff_ip},%{DATA:src_category},%{DATA:src_profile},%{DATA:src_model},%{DATA:src_vendor},%{DATA:src_osfamily},%{DATA:src_osversion},%{DATA:src_host},%{DATA:src_mac},%{DATA:dst_category},%{DATA:dst_profile},%{DATA:dst_model},%{DATA:dst_vendor},%{DATA:dst_osfamily},%{DATA:dst_osversion},%{DATA:dst_host},%{DATA:dst_mac},%{DATA:container_id},%{DATA:pod_namespace},%{DATA:pod_name},%{DATA:src_edl},%{DATA:dst_edl},%{DATA:hostid},%{DATA:serialnumber},%{DATA:src_dag},%{DATA:dst_dag},%{DATA:session_owner},%{DATA:high_res_timestamp},%{DATA:nssai_sst},%{DATA:nssai_sd},%{DATA:subcategory_of_app},%{DATA:category_of_app},%{DATA:technology_of_app},%{DATA:risk_of_app},“%{DATA:characteristic_of_app}”,%{DATA:container_of_app},%{DATA:tunneled_app},%{DATA:is_saas_of_app},%{DATA:sanctioned_state_of_app},%{DATA:offloaded},%{DATA:traffic_type},

Message:

panorama.div.company.com 1,2024/12/02 15:18:02,024680993766,TRAFFIC,end,2897,2024/12/02 15:18:02,10.88.151.8,10.99.101.17,0.0.0.0,0.0.0.0,inernal-in_37,msrpc-base,vdiv1,PROD.Internal,PROD.DIVNET,ethernet1/2,ethernet1/1,default,2024/12/02 15:18:02,655765,1,61636,49681,0,0,0x401a,tcp,allow,4896,3798,1098,16,2024/12/02 15:17:19,29,any,7432514093394733498,0x8000000000000000,10.0.0.0-10.255.255.255,10.0.0.0-10.255.255.255,9,7,tcp-rst-from-client,20,11,0,0,PA2.FL17,from-policy,0,0,N/A,0,0,0,0,1c563515-f2e7-227d-dc61-76f72a114777,0,0,2024-12-02T15:18:02.940-05:00,infrastructure,networking,network-protocol,2,“has-known-vulnerability,tunnel-other-application,pervasive-use”,msrpc,untunneled,no,no,0,NonProxyTraffic,

Same symptoms are happening: if we truncate the pattern after %{DATA:risk_of_app}, it works fine and groks the messages up to that point - but if we add “%{DATA:characteristic_of_app}”, or %{QUOTEDSTRING:characteristic_of_app}, Graylog will seize.

When the seizure happens there are no WARN or ERROR entries in the graylog-server log - the CPU usage spikes to around 400% and remains that high until the characteristic_of_app grok is removed and the graylog-server service is restarted.

I’m puzzled, because when I paste this into grokdebugger it still starts failing at natsport.

Please check your server log for the stack overflow exception I mentioned above (or other uncaught exceptions arising in pipeline processing). If that’s the main problem, then it’s not about the patterns as such, but the complexity of the data and pattern.

We searched the graylog-server logs, but found no stack overflow exception or any entry related to the seizures.

These were triply for sure checked and verified as working on Grok Debugger.

Pattern:
%{WORD:logsrc}.div.company.com %{NUMBER:num},%{DATA:receive_time},%{NUMBER:serial},%{WORD:type},%{WORD:subtype},%{NUMBER:unknown_num},%{DATA:time_generated},%{DATA:src},%{DATA:dst},%{DATA:natsrc},%{DATA:natdst},%{DATA:rule},%{DATA:srcuser},%{DATA:dstuser},%{DATA:app},%{DATA:vsys},%{DATA:from},%{DATA:to},%{DATA:inbound_if},%{DATA:outboundif},%{DATA:logset},%{DATA:unknown_time},%{NUMBER:sessionid},%{NUMBER:repeatcnt},%{NUMBER:sport},%{NUMBER:dport},%{NUMBER:natsport},%{NUMBER:natdport},%{DATA:flags},%{DATA:proto},%{DATA:action},%{NUMBER:bytes},%{NUMBER:bytes_sent},%{NUMBER:bytes_received},%{NUMBER:packets},%{DATA:start},%{NUMBER:elapsed},%{DATA:category},%{DATA:unknown_field1},%{DATA:seqno},%{DATA:actionflags},%{DATA:srcloc},%{DATA:dstloc},%{DATA:unknown_field2},%{DATA:pkts_sent},%{DATA:pkts_received},%{DATA:session_end_reason},%{DATA:dg_hier_level_1},%{DATA:dg_hier_level_2},%{DATA:dg_hier_level_3},%{DATA:dg_hier_level_4},%{DATA:vsys_name},%{DATA:device_name},%{DATA:action_source},%{DATA:src_uuid},%{DATA:dst_uuid},%{DATA:tunnelid},%{DATA:monitortag},%{DATA:parent_session_id},%{DATA:parent_start_time},%{DATA:tunnel},%{NUMBER:assoc_id},%{NUMBER:chunks},%{NUMBER:chunks_sent},%{NUMBER:chunks_received},%{DATA:rule_uuid},%{DATA:http2_connection},%{DATA:link_change_count},%{DATA:policy_id},%{DATA:link_switches},%{DATA:sdwan_cluster},%{DATA:sdwan_device_type},%{DATA:sdwan_cluster_type},%{DATA:sdwan_site},%{DATA:dynusergroup_name},%{DATA:xff_ip},%{DATA:src_category},%{DATA:src_profile},%{DATA:src_model},%{DATA:src_vendor},%{DATA:src_osfamily},%{DATA:src_osversion},%{DATA:src_host},%{DATA:src_mac},%{DATA:dst_category},%{DATA:dst_profile},%{DATA:dst_model},%{DATA:dst_vendor},%{DATA:dst_osfamily},%{DATA:dst_osversion},%{DATA:dst_host},%{DATA:dst_mac},%{DATA:container_id},%{DATA:pod_namespace},%{DATA:pod_name},%{DATA:src_edl},%{DATA:dst_edl},%{DATA:hostid},%{DATA:serialnumber},%{DATA:src_dag},%{DATA:dst_dag},%{DATA:session_owner},%{DATA:high_res_timestamp},%{DATA:nssai_sst},%{DATA:nssai_sd},%{DATA:subcategory_of_app},%{DATA:category_of_app},%{DATA:technology_of_app},%{DATA:risk_of_app},“%{DATA:characteristic_of_app}”,%{DATA:container_of_app},%{DATA:tunneled_app},%{DATA:is_saas_of_app},%{DATA:sanctioned_state_of_app},%{DATA:offloaded},%{DATA:traffic_type},

Sample:
panorama.div.company.com 1,2024/12/02 15:18:02,024680993766,TRAFFIC,end,2897,2024/12/02 15:18:02,10.88.151.8,10.99.101.17,0.0.0.0,0.0.0.0,inernal-in_37,msrpc-base,vdiv1,PROD.Internal,PROD.DIVNET,ethernet1/2,ethernet1/1,default,2024/12/02 15:18:02,655765,1,61636,49681,0,0,0x401a,tcp,allow,4896,3798,1098,16,2024/12/02 15:17:19,29,any,7432514093394733498,0x8000000000000000,10.0.0.0-10.255.255.255,10.0.0.0-10.255.255.255,9,7,tcp-rst-from-client,20,11,0,0,PA2.FL13,from-policy,0,0,N/A,0,0,0,0,1c563515-f2e7-227d-dc61-76f72a114777,0,0,2024-12-02T15:18:02.940-05:00,infrastructure,networking,network-protocol,2,“has-known-vulnerability,tunnel-other-application,pervasive-use”,msrpc,untunneled,no,no,0,NonProxyTraffic,

The problem appears to be in how the community forum software formats the text being pasted into it:

We noticed that if we test the text in Grok Debugger by copying from the left side entry pane, it will work successfully - but if we test by copying from the right side preview pane, it fails at natsport.

Good catch - the regular double-quotes got converted to left and right double-quotes.

However, I am still unable to get beyond natsport. The NUMBER pattern consumes the leading 0 of 0x401a so following patterns fail to match. Drew alrady pointed this out earlier: Use of ${QUOTEDSTRING} causes Graylog to seize - Part II - #5 by drewmiranda-gl
The data simply does not match the pattern.

Your screenshot of grokdebugger cuts off the interesting fields. Do you actually see natdport and subsequent fields being assigned values on the right side?

Do you actually see natdport and subsequent fields being assigned values on the right side?

Yes - if we use our original sources for the pattern and sample (normally kept in a text file via Notepad++), the entire pattern groks successfully, completely, and accurately in both Graylog Test with Sample Data and in Grok Debugger. If we copy the pattern/sample text from the community forum page, it fails on Grok Debugger after natsport.

Is there perhaps a way to send/upload a text file?

You could file an issue in our public github repo GitHub - Graylog2/graylog2-server: Free and open log management

I notice that the results I see diverge somewhere before outboundif, because that is being assigned the timestamp that should be matched with unknown_time.

{
“logsrc”: “panorama”,
“num”: 1,
“receive_time”: “2024/12/02 15:18:02”,
“serial”: 24680993766,
“type”: “TRAFFIC”,
“subtype”: “end”,
“unknown_num”: 2897,
“time_generated”: “2024/12/02 15:18:02”,
“src”: “10.88.151.8”,
“dst”: “10.99.101.17”,
“natsrc”: “0.0.0.0”,
“natdst”: “0.0.0.0”,
“rule”: “inernal-in_37”,
“srcuser”: “msrpc-base”,
“dstuser”: “vdiv1”,
“app”: “PROD.Internal”,
“vsys”: “PROD.DIVNET”,
“from”: “ethernet1/2”,
“to”: “ethernet1/1”,
“inbound_if”: “default”,
“outboundif”: “2024/12/02 15:18:02”,
“logset”: 655765,
“unknown_time”: 1,
“sessionid”: 61636,
“repeatcnt”: 49681,
“sport”: 0,
“dport”: 0,
“natsport”: 0
}
]

You could file an issue in our public github repo GitHub - Graylog2/graylog2-server: Free and open log management

We will look into the feasibility of doing this.

I notice that the results I see diverge somewhere before outboundif, because that is being assigned the timestamp that should be matched with unknown_time.

Don’t know what else to tell ya there. We only get those results when copying the pattern/sample text from the community message board posts here. When we use our own local copy of the pattern/sample text, it works fine - both in Graylog and on Grok Debugger.

We’ve been successfully/accurately parsing over 10,000 logs per minute with the truncated version of it for weeks now. The trouble only starts when we try to grok a quoted string, and even that works on the SYSTEM logs which only come in once per minute.

It seems like Drew’s idea about the quoted string grok causing massive CPU utilization (due to backtracking or some other issue) is the prevailing theory. When Graylog seizes, we notice via the top command that CPU utilization jumps to over 400%. Then when we remove the grok and restart the service, it appears to empty out a process buffer that built up during the seizure, then returns to a normal level of CPU utilization (around 25-40%).

We just don’t know how to grok the quoted strings from these logs at this volume without the seizures happening.

We use io.krakens.grok.api. QUOTED_STRING is defined here as

(?>(?<!\)(?>“(?>\.|[^\”]+)+“|”"|(?>‘(?>\.|[^\’]+)+‘)|’'|(?>(?>\\.|[^\\]+)+`)|``))

Maybe you can use a simpler pattern to match the specific strings you encounter in your TRAFFIC logs.