Failing JSON Extractor

mulgurul · December 7, 2018, 12:24pm

Hi

I’m quite new to Graylog configuration. I’m trying to set up a Filebeat input where there’s JSON single line events in the log files, like this:

{ “time”: “2018-12-07 12:56:12.2948”, “level”: “INFO”, “message”: “This is log event number 93 generated randomly in batch 374fa6d5-253a-4111-b274-f780a4255d47”, “CorrelationID”: “36835d49-c68e-4bb8-91e2-28954217e0a2”, “SessionID”: “6869282”, “Component”: “APP”, “ComponentVersion”: “1.5.0.0”, “Action”: “SEARCH”, “Method”: “SeachBook”, “MemberID”: “10820048”, “TimeSpent”: “” }

The sidecar is working, and when i submit logs I can see that they are received at greylog by looking at the in/out metric counter, and they appear in the input stream when searching, as long as I dont add an JSON extractor on the input.

I have understood that I need to add a JSON extractor on the message field. but when I do this, then no messages gets through.

The extractor seems to parse fine using the “Try” button;

I have looked at some of the topics regarding this. There’s some key name char restrictions. Does not seems to be my problem, then theres some date parsing. I dont know if my time field is parsed as a date object, and theres a problem here?

Can anyone please help me how to troubleshot this? Or maybe even a solution.
Thanks a lot.

And great product by the way:-)

benvanstaveren · December 7, 2018, 1:16pm

Well, first, don’t select “Flatten” - that just tries to stuff it all into a single field with a weird format; so uncheck that. Then there’s the issue that it may not want to work after all due to the JSON object also containing a field named “message”, and I’m not sure how that plays along with Graylog JSON extractor (especially in copy mode).

An alternative option is to do this in a pipeline, a bit more work but if you create a rule as follows:

rule "parse the json log entries"
when
  true
then
  let json_tree = parse_json(to_string($message.message));
  let json_fields = select_jsonpath(json_tree, { time: "$.time", level: "$.level", message: "$.message", CorrelationID: "$.CorrelationID", SessionID: "$.SessionID", Component: "$.Component", ComponentVersion: "$.ComponentVersion", Action: " $.Action", Method: "$.Method", MemberID: "$.MemberID", TimeSpent: "$.TimeSpent" });

 set_field("level", to_string(json_fields.level));
 set_field("message", to_string(json_fields.message));
 set_field("CorrelationID", to_string(json_fields.CorrelationID));
 # etc. etc. etc.

 set_field(timestamp, flex_parse_date(json_fields.time));
 remove_field("time");
end

That will do the trick too - instead of setting each field individually you can also set_fields(json_fields) - but doing it individually means you can to the proper to_string/to_bool/to_long/to_double typecasting if it’s required. It will also grab the “time” field from your message, and will replace Graylog’s timestamp with the “proper” time of the event occuring since Graylog will not do that automatically for you.

(Also if you use a JSON extractor, you would still require a 2nd extractor to move the time field into the timestamp field)

mulgurul · December 10, 2018, 12:19pm

Hi Ben
First of all, a huge thanks for helping out on my issue. And even provide a complete example of a pipeline. Much appreciated…

I tried to submit the pipeline action that you have provided, but as a newbie stumbled upon some problems.

There was a few syntax errors. Two I could fix, but this one I’m not sure how to fix:

set_field(timestamp, flex_parse_date(json_fields.time));

I Tried to make timestamp to const string:

set_field(“timestamp”, flex_parse_date(json_fields.time));

But then got error: Expected type “string” for argument value but found object value in call to flex_parse_data etc… Can you hint me to a fix on this one?

Uncommenting that line and testing with raw string Json in message, it gives me an unchanged messages. I cant figure out what I’m doing wrong?

Hope you can guide me here, thanks. //Peter

benvanstaveren · December 10, 2018, 10:46pm

I think I made a typo somewhere, you’re right when you do a set_field("timestamp", ...) - not sure about the second error, I may have mistyped the name of the function (flex_parse_date), not sure if it’s actually called that…it’ll be in the documentation though. Other than that I’m not entirely sure - I’ve used this pattern in some of my own pipeline rules, but since I’m on vacation I can’t really check right now

mulgurul · December 13, 2018, 8:55am

Hi again

I got it working, just needed some smaller adjustements,
Allthough I’m wondering still why Json extractor couldn’t do the job.
It shoudl be faster right, and somehow I like the idea of not having a dependency on a lot of hardcoded field names in a pipeline rule!

Best regards, Peter

jan · December 13, 2018, 9:14am

It shoudl be faster right, and somehow I like the idea of not having a dependency on a lot of hardcoded field names in a pipeline rule!

that is the reason that in Graylog 3.0 the function set_fields can be used to write the complete map into the key-value fields.

The nice about OSS software is that it can be extended by any user that has the need for something the initial developer did not thought about.

benvanstaveren · December 13, 2018, 9:36am

In the long run they’re both just as fast since it more or less does the same thing - and realistically speaking, a few microseconds more or less isn’t going to hurt Also, as @jan pointed out (and I keep pointing out his pointings out), in Graylog 3.0 (coming, as far as I recall, in February 2019) you can use set_fields() with the direct output of parse_json without the need for the intermediate select_jsonpath.

Personally I prefer pipelines over extractors, mostly because you attach a pipeline to a stream which generally means you already “know” what you’ll be parsing. For example, our Graylog setup ingests about 2500msg/sec, and our heaviest pipeline only processes ~700 of those due to the stream routing being done before pipelines run, so we can tailor the pipelines very specifically to a subset of what we ingest.

To put that in perspective a little bit, everything comes in via Filebeat, about half of messages are generally formatted in syslog style lines, then we have about 33% with custom log formats that require additional parsing and generous use of lookup tables, the rest comes in from our microservice cluster, in various formats including a pure JSON format, where a pipeline “does the right thing” based on some fields being present (or absent) in a message.

We wouldn’t be able to run it as efficiently with only extractors. Personally speaking, I think pipelines are the more flexible feature, and while currently they do require some hardcoding of fields, you do have the benefit of knowing exactly what goes into your message once it’s parsed and processed.

mulgurul · December 14, 2018, 11:27am

Thanks for all your great and valuable feedback. I’m then continuing down the pipeline path, but have run into some other issues here. I’m having a little hard time to understand exactly what the pipeline does when the “Then” code has been executed. Does it just modify the original message so this modifications are visible when searching and showing the message in “search”?.

What happens is that my rule seems to be able to parse the Json in message field in the simulator, but when receiving messages from the real stream, it fails ALL messages. I took the message directly from a recent event, and tried it it the simulator. It was parsed fine.

Here’s an example of the data:

{ “time”: “2018-12-14 12:02:11.9210”, “level”: “INFO”, “message”: “DODP operation-entry: getBookmarks”, “CorrelationID”: “c472e04b-6ee2-47f1-9684-d354feb15d82”, “SessionID”: “UFbMbnX8i0a8BxqH6A_s-g”, “Component”: “DodpMobile”, “ComponentVersion”: “1.0.1.0”, “Action”: “DODP-GETBOOKMARKS”, “MethodNameOrURL”: “/getBookmarks”, “MemberID”: “”, “BookNo”: null, “TimeSpentInMs”: “0”, “Hostname”: “BETADLWEB02”, “InitiatorIPAddress”: “192.168.0.7”, “UserAgent”: “DodpReader;1.0.40.25070”, “ResultCode”: null, “DataAsJson”: null }

Here’s the Rule:
rule “parse the json log entries”
when
true
then
let json_tree = parse_json(to_string(message.message)); let json_fields = select_jsonpath(json_tree, { time: ".time", level: “.level", message: ".message”, CorrelationID: “.CorrelationID", SessionID: ".SessionID”, Component: “.Component", ComponentVersion: ".ComponentVersion”, Action: “.Action", ResultCode: ".ResultCode”, DataAsJson: “.DataAsJson" , InitiatorIPAddress: ".InitiatorIPAddress” , MethodNameOrURL: “.MethodNameOrURL", MemberID: ".MemberID”, UserAgent: “.UserAgent", TimeSpentInMs: ".TimeSpentInMs”, BookNo: “.BookNo", SourceLine: "._source_line”, SourceMethod: “._source_method", EventDateAtOrigin: ".EventDateAtOrigin” });

set_field(“EventDateAtOrigin”, to_date(json_fields.EventDateAtOrigin));
set_field(“Level”, to_string(json_fields.Level));
set_field(“CorrelationID”, to_string(json_fields.CorrelationID));
set_field(“SessionID”, to_string(json_fields.SessionID));
set_field(“BookNo”, to_string(json_fields.BookNo));
set_field(“Hostname”, to_string(json_fields.Hostname));
set_field(“UserAgent”, to_string(json_fields.UserAgent));
set_field(“ResultCode”, to_long(json_fields.ResultCode));
set_field(“ComponentVersion”, to_string(json_fields.ComponentVersion));
set_field(“Component”, to_string(json_fields.Component));
set_field(“Action”, to_string(json_fields.Action));
set_field(“MemberID”, to_string(json_fields.MemberID));
set_field(“TimeSpentInMs”, to_long(json_fields.TimeSpentInMs));
set_field(“DataAsJson”, to_string(json_fields.DataAsJson));
set_field(“MethodNameOrURL”, to_string(json_fields.MethodNameOrURL));
set_field(“InitiatorIPAddress”, to_string(json_fields.InitiatorIPAddress));
set_field(“time”, to_string(json_fields.time));
set_field(“SourceLine”, to_string(json_fields.SourceLine));
set_field(“SourceMethod”, to_string(json_fields.SourceMethod));

set_field(“timestamp”, parse_date(substring(to_string(json_fields.time), 0, 23), “yyyy-MM-dd HH:mm:ss.SSS”));
remove_field(“time”);
end

Simulator shows:

Here’s where I see errors:

Or at least think it means errors evem 0 error/s(1145) could be understood in many ways, anyway, theres no changes to the messages when I search. No new or modified fields. (Do I need to cycle index?)
I dont know how to troubleshoot this. Do I need to look into Graylog server logs?

Sorry that I have to ask for more help…

//Peter

benvanstaveren · December 14, 2018, 1:53pm

Hi Peter,

The rule looks good except for the select_jsonpath, not sure if the forum software ate it, but you need to prefix the path with “$”.

And yes your view of the errors is correct, the throughput and errors are at 0/sec (probably because you aren’t putting anything through the stream), and over the lifetime of your pipeline it has picked up 1145 errors altogether.

Another reason it may error out is that the json parsing failed, I think that does show up in the main server log. You may need to take a look at that and throw some messages through and see what happens, and perhaps alter your rule after to have a more targeted condition - e.g. do a regex check for "^\\{.*\}}$" (keep in mind this is a little bit expensive to run) to ensure that the message field at least looks like JSON before attempting to parse it.

Other than that I can’t really think of anything that would cause the errors unfortunately

jan · December 14, 2018, 2:29pm

The rule looks good except for the select_jsonpath, not sure if the forum software ate it, but you need to prefix the path with “$”.

that is the reason you should use in this community to write code/rules

```
 your code / rule here
```

macko003 · December 14, 2018, 3:54pm

We use different ports for different applications/functions, because it is easier to make stream about the input ID, you don’t need use multiple rules, or use regex about the hostnames. Also easier to debug the network traffic, and if you got a web server, you don’t need to administrate on graylog side, just set the 1234 port, and it will go the webserver stream.

But there is no good solution, it’s just my opinion.

benvanstaveren · December 14, 2018, 9:36pm

To borrow a Perl-ism, TIMTOWTDI We are in the “lucky” position that all our input is through Beats, and that we tag things with additional fields before they get to Graylog. Our stream rules are all exact-match on one field - except 3 streams that have 2 exact-match rules. So in general just as fast as checking the input ID

The reason we run it like that is that by way of the extra fields we know exactly which app is logging, and by extension the exact format it’s in, so we can tailor a pipeline exclusively to that (and extract additional data that we want).

I don’t really agree that there is “no” good solution, I’d like to rephrase that as “there are many good solutions, but which one is the best is subjective”

macko003 · December 14, 2018, 9:45pm

You are very lucky.
We have more than 40 streams, 600+ sources, 50+ different source type
Only the timestamp and the messages are the common fields in the logs. I’m lucky when the GL can phrase hostname.

benvanstaveren · December 14, 2018, 9:47pm

Damn son… yeah, that gets complicated fast!

system · December 28, 2018, 9:47pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pipeline Rule doesn't see extracted fields but only a few $message properties like $message.message and $message.level Graylog Central (peer support) pipeline-rules , debuggingpl	4	2332	November 29, 2017
JSON Extractor stops messages from showing up in input Graylog Central (peer support)	9	1131	August 10, 2022
Pipeline rule to extract json not working Graylog Central (peer support)	7	1120	June 27, 2022
JSON extractor not working? Graylog Central (peer support)	5	4853	September 24, 2018
JSON extraction in pipeline rules Graylog Central (peer support) pipeline-rules	7	5527	November 9, 2017

Failing JSON Extractor

Related topics