Syslog -> input -> stream -> pipeline: JSON sources appear to be problematic

Specifics:

  • OS = Debian 12
  • Graylog = 7.0.3
  • Opensearch = 2.19.4
  • mongodb = 8.0.18

Flow:

zeek (JSON output) → syslog-ng → input → stream → pipeline → rule stages

Notes:

  • There is only one input extractor which uses regexp replace to add a field of “is_json” with a value of “json” so that the rule on the pipeline knows to process ‘this’ message.
  • It appears that once a JSON conversion is done (tried both input extractor and pipeline rule) - in both cases, all of the JSON converted fields are non-modifiable. (eg: in a pipeline rule a remove_single_field(“<field_name>”); does NOT work. Field is not removed, nor is there any error in the graylog log.
  • What’s key is four fields need to be extracted and placed into fields for the bulk of process: source IP, source port, destination IP, destination port.
  • structure in $message.message: “ {}”. Which should be very easy to convert as it’s purely key/value pairs, flat.

Performing the JSON processing (from message to fields) in the extractor phase or as a pipeline rule - appears to make no difference in terms of this issue. Combinations tried:

  • JSON conv in extractor, add fields in extractor
  • JSON conv in extractor, add fields in pipeline
  • JSON conv in pipeline, add fields in pipeline

The conversion to JSON [itself] obviously works as deleting the extractor or removing the pipeline rule leaves only the “ ” in the message field and no [JSON] fields. Enable either the extractor or the pipeline rule and all the fields appear on every message, but…. you cannot do ANYTHING with those fields. Have tried running various tests is_null, is_not_null, is_string, etc. just to see what the backend “thinks” about it (is_string and is_not_null come back as true, is_null returns false). The results are what one would expect, but cannot actually read the data. You can see the field and data value when viewing messages in the stream, but when you do debug(concat("<fieldname>: ", to_string($message.<fieldname>))); (this is done in a later stage rule to ensure that the conversion was done in a prior stage) - you get NOTHING after “:” in the server log. As a result, all of the JSON fields are also non-searchable via the search/filter in the stream view. It’s as though they are present and nothing more - which means there’s very little value in terms of analytics.

Interestingly, the field “is_json” mentioned earlier, the value can be read and the field deleted without issue (neither function works on any field created through any means of conversion from JSON to fields). “is_json” being a means to identify which messages need to be parsed and gets removed (cleanup) in a later stage rule. Premise was to minimize any extraneous data from being stored and has also become validation that the later rule is ‘firing’ - the field “is_json” is not present when remove_single_field(“is_json”) is used in the later stage rule.

At this juncture, not sure what to think about it and after spending a good number of hours trying different things to analyze and get it working - need some fresh eyes/ideas on this.

Conversion to JSON pipeline rule contents (based on a posting from these forums):

rule “JSON Parser”
when
has_field(“is_json”) AND
(to_string($message.is_json)) == “json”
then
let prepjson = regex_replace(“^\\S* “,to_string($message.message),””);
let the_json = parse_json(to_string(prepjson));
let the_map = to_map(the_json);
set_fields(the_map);
end

Have also tried the above without the second “to_string” on the “let the_json” line. Just in case something was having heartburn with to_string→to_string. (made no difference)

Temporary work-around was to revert everything to be extractor based and creating regex extractors to begin to work with a small fraction of the data. The problem is significant variance in the JSON structure as its 10-20 different logs from one source. The goal was to be able to analyze certain types of events through “_exists_:” + “:” statements to limit which items are retrieved - knowing which field is present for a given type of element.

Any ideas would be greatly appreciated.

Thanks!

Try using the flatten json function in your rule and see what happens.

My guess off the top of my head is that because of how you are doing it the field names are ending up with dots in them, thats not valid for opensearch to store so it will co vert the dots to underscores when it stored the message. This means that yes what the pipeline sees and what you see in graylog after can be different, whoch causes issues exactly like this.

You can also do a set field(to string(to map(parsed json)))) and that will let you see the raw field names its using at the time of the rule.

That appears to have solved it!!! A HUGE Thank you!

The source JSON was structured as “.” for some of the data fields coming across. Would not have expected a period to be an issue like that. Would be really nice if Graylog were a little more verbose about bad characters as opposed to just accepting data and concealing it. eg: what was seen in the GUI was “id_name_ext” and the source was “id.name_ext”. While one can certainly appreciate the translation effect, there wasn’t anything that gave a clue about the error condition being present.