And again, Graylog does not output anything

As per previous messages here, I just got done trying to deal with 180+ million messages being stuck in the journal, with 0 messages being read or processed. Elasticsearch is fine (i.e. no read-only indices, plenty disk space).

When requesting a processbuffer dump, there are still idle process buffers present so it’s not getting stuck there it seems; a thread dump shows all output buffer processors in a waiting state (with a full output buffer), as per the following error message:

"outputbufferprocessor-14" id=52 state=WAITING
    - waiting on <0x39071caf> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x39071caf> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:47)
    at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
    at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
    at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
    at java.lang.Thread.run(Thread.java:748)

This has been the 3rd or 4th time in a few months this has happened, we’ve lost a fair amount of logs in the process, and the higher ups are starting to make “maybe it’s time to change that for something else” noises. I’ve checked metrics, they don’t show anything untoward happening. I’ve been over our Elastic cluster with a fine toothed comb, it’s working fine. I’ve been over all variations of disabling pipelines, extractors, and so on - but this problem keeps occurring in some form or other.

I’m sort of at my wits end here because I can’t figure out why on earth all outputbuffer processors would be in a waiting state unless there’s a deadlock occurring somewhere.

1 Like

Still worth specifically asking if Elastic didn’t fault to read-only because that’s the usual cause. What versions of things are you running? Nothing interesting in ElasticSearch log that we can blankly stare at? :smiley:

Have to run - will think of more Q’s on ride home.

@benvanstaveren

I just had this happen last week, I was screwing around with some pipelines and extractors, Next thing I noticed was my journal filled up quick in hour. From what I can remember I had bad log from my screwing around and had to dump my journal. When all that was going on my Output buffer was in a waiting state, nothing was shown on the dashboard during that time.

First thing I checked :smiley: It’s not in read-only mode, because there is the occasional little spurt of data that does get stored, and our other tooling has no issues. A check of the cluster state also shows that nothing is set to read-only.

Huh, if the journal got corrupted due to mucking with pipelines and extractors that’s a big fat uh-oh; if it’s because there’s log entries that lock up the pipeline, that’s also a bit of an uh-oh that there is no mechanism to force it to “fuhgeddaboutit” and just store the damn log entry (or drop it, with appropriate loud complaining in logs).

Hey @benvanstaveren

You probably heard this before, do you have extractors or pipelines by chance? if so did you check them? Only thing I can think of why those Output buffer are in a waiting state an journal is full.

That is interesting… Jammed up somewhere so it’s slowed to a crawl. Is there anything unique about what is getting through? like just one stream of data or in the timing of how they get through? Do you have failure processing turned on? (gl2.processing_error in the message fields) BTW - the process thread dump has a bunch of the same WAITING states as you initially posted… not sure if that is an error or just an odd standard for their states… What versions are you on all the things? Not that that will help much… perhaps with searching for known bugs… I seem to recall you are clustered? Does it make an difference dropping a node out of the cluster (if so) reaching for straws and trying to jigger anything new into troubleshooting

Hello @benvanstaveren

I’ve been going over your other posts, which I think the same issue was occurring, TBH I’m out of ideas.

@H2Cyber @ihe @tfpk @cawfehman @patrickmann @jan @aaronsachs Hey guys, sorry to bother you all, but this situation is kind of odd, from the links below you can follow this past year what going on with @benvanstaveren setup. I’m kind of out of ideas, an was wondering if any of you have a suggestion on this?

1 Like

I do, but they’re very simple; 4 extractors on the input, 1 tries to parse a field as JSON, 1 just regexes the “file” field (on a beats input) to extract a fragment, and 2 will take that fragment and go for a lookup table that uses a http adaptor to resolve values. The caches on the lookup table show a 99.99% hit rate, and the service that provides the responses is only doing about 2 to 3 requests per second.

However it seems it may have been a pipeline issue - we have 2 pipelines attached to the same stream, due to legacy reasons most of our logs come in as JSON formatted messages that get picked up by the extractor (it attempts to parse the message field), and some logs are still plain text. I can check whether or not the original message was JSON by looking for 2 fields that only appear in the JSON based logs, so that is rule #1 in both pipelines (one for yes it was JSON, one that decides okay it isn’t JSON). The latter one (plain text), uses a Grok pattern to attempt to parse data. This pattern is fully anchored (^ and $), and has worked reliably for … boh… at least 4 years.

When I disconnected that particular pipeline from the stream, things started moving along again, and the backlogged messages were all stored relatively quickly after that and it’s keeping up with whatever comes in now. This isn’t ideal since we do use some of the log data to aggregate some numbers for the beancounters, but it is what it is. It still doesn’t explain why it’s misbehaving after this much time of working fine, it could also be the messages themselves that come in, but…

Unfortunately debugging a pipeline and it’s rules to see where things go tits up isn’t easy, in the metrics I can only see the rule ID, not the name associated with it, and as far as I can tell there’s no metric that shows a cumulative/avg time-spent-on-rule value so you can narrow down which one is taking more time than it should. Counts for msg/sec are fine but due to the pipeline chain and it skipping certain rules under certain conditions it’s a bit of a wild goose chase.

I would like to add that it would be brilliant if we can configure some sort of high water mark for things like time spent on any single rule, as in, if a rule takes more than 500ms to complete, stop the pipeline, and store the message with a gl2_rule_timeout field or something indicating which rule and how much time it took and so on. IMO it’s better to have semi-processed logs stored than to bring things to a grinding halt :smiley: Maybe even a pipeline wide option so I can say if total time spent by a message in a pipeline is >2s to do a similar thing, perhaps even add a field with all the rule timing so it’s easier to see what’s up.Right now rules and pipelines are a bit too black box for my liking :wink:

See my reply to @gsmith above - it seems to be pipeline related, but for the sake of completeness: the messages that do make it are from multiple streams, it seems occasionally a set of process buffers will get flushed and it’s a mixed bag. Most of our logs come from a single set of applications so it’s not super easy to see.

And yes we are clustered, 3 nodes. Dropping one out makes no difference, I’ve jiggled them all in various combinations - what does happen is that if you restart a node, it will come back up and will gleefully start outputting at a high rate, and then it just slumps down and eventually output is at 0 again. No ES errors at that point, no failed indexing attempts, just… pfft, zip :slight_smile:

I am running a rather old Graylog version (3.3.2 off the top of my head) due to a custom plugin we use that hasn’t gotten ported to 4.x yet, with ES 6.7for storage. We run on openjdk version 1.8.0_292 which may be a little outdated but again, has worked fine in the past. Also the hardware is AMD EPYC based, not entirely sure that’d make a difference but it might.

I also tried changing the various strategies from yielding to blocking (and back) but that didn’t seem to make a difference at the time.

Actually that is an awesome idea.

1 Like

Occasionally I have one :smiley: I can go put it in as a feature request on Github if y’all like?

3 Likes

@benvanstaveren

Yes please :slight_smile:

I am thinking that perhaps sticking a debug() function in the rule might give you an inkling of timing and even what was the last thing it was working on when it locks up. You have a high rate so it may be a little hard to work with all that pushing through the logs. An idea anyway.

If it is possible, you could post the rules n GROKs for review… just in case.

@gsmith raised all the flags and I raised a few too to see if we get more eyes on this

v3.3.2 tho… :crazy_face:

I know right on the 3.3.2 - I’ve got to upgrade at some point but it’s unfortunately become such a critical part of our business logic that it needs careful planning and execution - I can do the latter, suck at the former.

debug() isn’t an option, we’re pushing well over 2k messages per second through that pipeline normally, and spread across 3 nodes it’s an absolute pain in the yannowhat to work with :slight_smile: Anyway, I put the pipeline rules for the potentially troubled pipeline down below; come to think of it after seeing them again there are a few potential problem points… but for completeness sake, here they are:

Pipeline stage 0; 1 rule:

rule "check if not json"
when
  # these fields only appear when the JSON extractor on the input worked
  has_field("short_message") == false && has_field("full_message") == false
then
end

Stage 1: 2 rules, match any

This rule is the one that can blow up because it’s not anchored fully; it’s a hold-over from some legacy log solutions that added a JSON encoded block on the end of a plaintext line

rule "extract data from message (GROK_META)"
when
    regex("^\\{.*\\}$", to_string($message.message)).matches == false &&
    regex(".*\\{.*?\\}$", to_string($message.message)).matches == true
then
    let fields = grok("^\\[%{WORD:xxx_kernel_id}\\]\\[%{WORD:xxx_yyy_id}\\]\\[%{WORD:xxx_log_level}\\]\\[%{XXXTIMESTAMP:xxx_timestamp}\\]\\[%{XXXJAVACLASS:xxx_java_class}\\] %{GREEDYDATA:xxx_message} %{JSON:xxx_metadata_raw}$", to_string($message.message), true);
    set_fields(fields);
    set_field("message", to_string($message.xxx_message));
    remove_field("xxx_message");
    set_field("xxx_parsed_as", "GROK_META");
end

Patterns in this rule:
XXXJAVACLASS = .?
XXXTIMESTAMP = %{MONTHDAY}/%{MONTHNUM}/%{YEAR}_%{HOUR}:%{MINUTE}:%{SECOND}:%{NUMBER}
JSON = {.
?}

And this rule which basically duplicates the above but should only trigger when that encoded JSON block isn’t present (I realise it can be done with a single rule since it’d just add a blank field for the missing block, but legacy stuff requires it to be filled when present, and to have the xxx_parsed_as field set to an appropriate value, and rules have no flow control so… this was the solution.

rule "extract data from message (GROK_NO_META)"
when
    regex("^\\{.*\\}$", to_string($message.message)).matches == false &&
    regex(".*\\{.*?\\}$", to_string($message.message)).matches == false
then
    let fields = grok("\\[%{WORD:xxx_kernel_id}\\]\\[%{WORD:xxx_yyy_id}\\]\\[%{WORD:xxx_log_level}\\]\\[%{XXXTIMESTAMP:xxx_timestamp}\\]\\[%{XXXJAVACLASS:xxx_java_class}\\] %{GREEDYDATA:xxx_message}", to_string($message.message), true);
    set_field("xxx_parsed_as", "GROK_NO_META");
    # rest identical to rule above
end

Stage 2: 10 rules, match any

rule "fix warn/warning log level"
when
    has_field("xxx_log_level") &&
    contains(to_string($message.xxx_log_level), "WARNING")
then
    set_field("xxx_log_level", "WARN");
end
rule "convert timestamp"
when
    has_field("xxx_timestamp")
then
    let ts = parse_date(value: to_string($message.xxx_timestamp), pattern: "dd/MM/YYYY_HH:mm:ss:SSS");
    set_field("timestamp", ts);
    remove_field("xxx_timestamp");
end
rule "check for action: XXX"
when
    has_field("message") &&
    contains(to_string($message.message), "WS_ReactiveHavetoRedactThis", true) &&
    contains(to_string($message.message), "REST") && 
    regex("(PUT|POST)", to_string($message.message)).matches == true &&
    regex("XXX", to_string($message.message)).matches == true
then
    set_field("is_action", true);
    set_field("action_type", "XXX");
end

The above rule is duplicated twice, just different values for the XXX part.

rule "check sensor action"
when
    has_field("message") && contains(to_string($message.message), "sensor action", true) && contains(to_string($message.message), "EVENT", true)
then
    set_field("sensor_action", true);
end
rule "check for agenda switch"
when
    has_field("message") && contains(to_string($message.message), "switching agenda", true)
then
    set_field("agenda_switch", true);
end
rule "lookup brand"
when
    has_field("xxx_yyy_id")
then
    let result = lookup_value("xxx-yyy-brand", to_string($message.xxx_yyy_id), "redacteddefault");
    set_field("xxx_brand", result);
end

And the last one:

rule "parse legacy json block"
when
    has_field("xxx_metadata_raw")
then
    let json = parse_json(to_string($message.xxx_metadata_raw));
    let json_fields = select_jsonpath(json, { 
        x_logged_user: "$.X-Logged-User",
        x_client_id: "$.X-Client-Id",
        x_client_version: "$.X-Client-Version",
        x_brand_id: "$.X-Brand-Id",
        x_application_id: "$.X-Application-Id"
    });
    set_fields(json_fields);
end

This is the only pipeline that uses grok patterns, and generally catches about 40% of our total traffic.

Done, filed as Add configurable options for pipeline and pipeline rule timeouts · Issue #13583 · Graylog2/graylog2-server · GitHub

1 Like

Still looking at this… can we assume the message coming in with the plaintext is single line at least until it hits the JSON? I don’ think the GROK is problematic … but you have JSON = {.?} is that right or did you mean JSON = {.*?} (EFFIN forum formatting) being greedy without being lazy is less expensive there - maybe parse it out as a JSON rather than a GROK? see below…

I have been focusing on the WHEN parts with regex. I am thinking we can break out the plaintext from the JSON much more efficiently with the below. Lettings things fail over to “default” can be powerful…

   //Grab first found group - assumes all text before json is a single line (\A)
   let the_text          = regex("\\A(.*)\\{", to_string($message.message))["0"];
   //Subtract the the_text to have the JSON
   let the_json          = replace(to_string($message.message),to_string(the_text), "" );
   
   // if you find anything in the_json, make the whole thing to be GROK_META otherwise let it be NULL
   let json_exist = regex_replace(".+",to_string(the_json),"GROK_META");
   
   // set the field to json_exist (GROK_META) unless it's null, then use "GROK_NO_META"
   set_field("xxx_parsed_as", json_exist, "GROK_NO_META");

SIDE-NOTE: I had a lot of house animal interruptions while trying to convolute the META/NO_META stuff so … it may not work. :crazy_face:

Not sure if you need more verification of message coming in (verifying plaintext or JSON parts)
From there you can GROK or parse_json() whichever is more efficient (haven’t researched that) My guess is JSON parsing since you are already doing that towards the end…

With 2k messages per second, every efficiency counts!! (and yea, debug() would be super messy) I am trying to avoid GREEDYDATA and multiple greedies (".*\\{.*?\\}$") even if they are lazy/locked to the start or end. … You could dump things into fields and clean them up in the next stage but even that feels expensive.

More thoughts as I look through:

small one - can you shorten up your warning cleanup?

rule "fix warn/warning log level"
when
    to_string($message.xxx_log_level) == "WARNING"
then
    set_field("xxx_log_level", "WARN");
end

I wonder if contains or regex is more efficient in Check for action - only because they are mixed. this is where your idea of watching rule time would prove out some interesting data.

LAST CAVEAT - this all conjecture - I was not able to test it…

1 Like

Hey everyone,

jumping right in here to drop my 2 cents. :wink:

When reading this thread, first thing coming to my mind is the focus on pipelines and functions, when the original errors happened at the output queue? From my state of experience, i would expect the process buffers to fill up when there is a problem in message processing, not the output buffers (although i’ve seen both filling up when there was an backend error back-pressuring at the processing queue).

@benvanstaveren, would it make sense to have another closer look at the ES side? By reading here i already learned you run 3 graylog nodes, how does the backend side look like (how many nodes, what storage devices)? How is your index configured? And, most important, do you have an eye on node performance? ES can be quite a bottleneck, when indexing, cluster-syncing and search query load come altogether. With high loads, you still would see a green cluster state, although ES doesn’t perform at all. Do you have all your mesages in a single index set?

18 nodes, consisting of 3 masters, 3 coordinating nodes, 4 hot nodes and 8 warm nodes - the 4 hot nodes all have SSD drives, and run on AMD Ryzen 7 CPU’s - they can actually handle more throughput than required. They only contain “today’s” indices (external tool ensures they get moved at an appropriate time). The other 8 nodes are for today+1…n searches (majority of our search traffic). Indices have 1 original and 1 replica shard when they’re hot, and 1 original and 2 replica shards when they’re warm.

And things are split on an index set per stream, and we have about 20 streams.

Thanks for the intel, you got a pretty descent setup there. :slight_smile: Sorry for bringing up the question, just wanted to clear things out a bit.

So as if it seems to actually be an processing pipeline problem, maybe a closer look at the “Rule Metrics” would be helpful. I know it’s an awful step to take to extract the needed ids from the metrics name (sic!), but you can get rule runtimes for evaluation and execution of the rules. Maybe that could help identify the problem source?