Failing JSON Extractor


(Peter Meldgaard) #1

Hi

I’m quite new to Graylog configuration. I’m trying to set up a Filebeat input where there’s JSON single line events in the log files, like this:

{ “time”: “2018-12-07 12:56:12.2948”, “level”: “INFO”, “message”: “This is log event number 93 generated randomly in batch 374fa6d5-253a-4111-b274-f780a4255d47”, “CorrelationID”: “36835d49-c68e-4bb8-91e2-28954217e0a2”, “SessionID”: “6869282”, “Component”: “APP”, “ComponentVersion”: “1.5.0.0”, “Action”: “SEARCH”, “Method”: “SeachBook”, “MemberID”: “10820048”, “TimeSpent”: “” }

The sidecar is working, and when i submit logs I can see that they are received at greylog by looking at the in/out metric counter, and they appear in the input stream when searching, as long as I dont add an JSON extractor on the input.

I have understood that I need to add a JSON extractor on the message field. but when I do this, then no messages gets through.

The extractor seems to parse fine using the “Try” button;

I have looked at some of the topics regarding this. There’s some key name char restrictions. Does not seems to be my problem, then theres some date parsing. I dont know if my time field is parsed as a date object, and theres a problem here?

Can anyone please help me how to troubleshot this? Or maybe even a solution.
Thanks a lot.

And great product by the way:-)


(Ben van Staveren) #2

Well, first, don’t select “Flatten” - that just tries to stuff it all into a single field with a weird format; so uncheck that. Then there’s the issue that it may not want to work after all due to the JSON object also containing a field named “message”, and I’m not sure how that plays along with Graylog JSON extractor (especially in copy mode).

An alternative option is to do this in a pipeline, a bit more work but if you create a rule as follows:

rule "parse the json log entries"
when
  true
then
  let json_tree = parse_json(to_string($message.message));
  let json_fields = select_jsonpath(json_tree, { time: "$.time", level: "$.level", message: "$.message", CorrelationID: "$.CorrelationID", SessionID: "$.SessionID", Component: "$.Component", ComponentVersion: "$.ComponentVersion", Action: " $.Action", Method: "$.Method", MemberID: "$.MemberID", TimeSpent: "$.TimeSpent" });

 set_field("level", to_string(json_fields.level));
 set_field("message", to_string(json_fields.message));
 set_field("CorrelationID", to_string(json_fields.CorrelationID));
 # etc. etc. etc.

 set_field(timestamp, flex_parse_date(json_fields.time));
 remove_field("time");
end

That will do the trick too - instead of setting each field individually you can also set_fields(json_fields) - but doing it individually means you can to the proper to_string/to_bool/to_long/to_double typecasting if it’s required. It will also grab the “time” field from your message, and will replace Graylog’s timestamp with the “proper” time of the event occuring since Graylog will not do that automatically for you.

(Also if you use a JSON extractor, you would still require a 2nd extractor to move the time field into the timestamp field)


(Peter Meldgaard) #3

Hi Ben
First of all, a huge thanks for helping out on my issue. And even provide a complete example of a pipeline. Much appreciated…

I tried to submit the pipeline action that you have provided, but as a newbie stumbled upon some problems.

There was a few syntax errors. Two I could fix, but this one I’m not sure how to fix:

set_field(timestamp, flex_parse_date(json_fields.time));

I Tried to make timestamp to const string:

set_field(“timestamp”, flex_parse_date(json_fields.time));

But then got error: Expected type “string” for argument value but found object value in call to flex_parse_data etc… Can you hint me to a fix on this one?

Uncommenting that line and testing with raw string Json in message, it gives me an unchanged messages. I cant figure out what I’m doing wrong?

Hope you can guide me here, thanks. //Peter


(Ben van Staveren) #4

I think I made a typo somewhere, you’re right when you do a set_field("timestamp", ...) - not sure about the second error, I may have mistyped the name of the function (flex_parse_date), not sure if it’s actually called that…it’ll be in the documentation though. Other than that I’m not entirely sure - I’ve used this pattern in some of my own pipeline rules, but since I’m on vacation I can’t really check right now :frowning:


(Peter Meldgaard) #5

Hi again

I got it working, just needed some smaller adjustements,
Allthough I’m wondering still why Json extractor couldn’t do the job.
It shoudl be faster right, and somehow I like the idea of not having a dependency on a lot of hardcoded field names in a pipeline rule!

Best regards, Peter


(Jan Doberstein) #6

It shoudl be faster right, and somehow I like the idea of not having a dependency on a lot of hardcoded field names in a pipeline rule!

that is the reason that in Graylog 3.0 the function set_fields can be used to write the complete map into the key-value fields.

The nice about OSS software is that it can be extended by any user that has the need for something the initial developer did not thought about.


(Ben van Staveren) #7

In the long run they’re both just as fast since it more or less does the same thing - and realistically speaking, a few microseconds more or less isn’t going to hurt :wink: Also, as @jan pointed out (and I keep pointing out his pointings out), in Graylog 3.0 (coming, as far as I recall, in February 2019) you can use set_fields() with the direct output of parse_json without the need for the intermediate select_jsonpath.

Personally I prefer pipelines over extractors, mostly because you attach a pipeline to a stream which generally means you already “know” what you’ll be parsing. For example, our Graylog setup ingests about 2500msg/sec, and our heaviest pipeline only processes ~700 of those due to the stream routing being done before pipelines run, so we can tailor the pipelines very specifically to a subset of what we ingest.

To put that in perspective a little bit, everything comes in via Filebeat, about half of messages are generally formatted in syslog style lines, then we have about 33% with custom log formats that require additional parsing and generous use of lookup tables, the rest comes in from our microservice cluster, in various formats including a pure JSON format, where a pipeline “does the right thing” based on some fields being present (or absent) in a message.

We wouldn’t be able to run it as efficiently with only extractors. Personally speaking, I think pipelines are the more flexible feature, and while currently they do require some hardcoding of fields, you do have the benefit of knowing exactly what goes into your message once it’s parsed and processed.


(Peter Meldgaard) #8

Thanks for all your great and valuable feedback. I’m then continuing down the pipeline path, but have run into some other issues here. I’m having a little hard time to understand exactly what the pipeline does when the “Then” code has been executed. Does it just modify the original message so this modifications are visible when searching and showing the message in “search”?.

What happens is that my rule seems to be able to parse the Json in message field in the simulator, but when receiving messages from the real stream, it fails ALL messages. I took the message directly from a recent event, and tried it it the simulator. It was parsed fine.

Here’s an example of the data:

{ “time”: “2018-12-14 12:02:11.9210”, “level”: “INFO”, “message”: “DODP operation-entry: getBookmarks”, “CorrelationID”: “c472e04b-6ee2-47f1-9684-d354feb15d82”, “SessionID”: “UFbMbnX8i0a8BxqH6A_s-g”, “Component”: “DodpMobile”, “ComponentVersion”: “1.0.1.0”, “Action”: “DODP-GETBOOKMARKS”, “MethodNameOrURL”: “/getBookmarks”, “MemberID”: “”, “BookNo”: null, “TimeSpentInMs”: “0”, “Hostname”: “BETADLWEB02”, “InitiatorIPAddress”: “192.168.0.7”, “UserAgent”: “DodpReader;1.0.40.25070”, “ResultCode”: null, “DataAsJson”: null }

Here’s the Rule:
rule “parse the json log entries”
when
true
then
let json_tree = parse_json(to_string(message.message)); let json_fields = select_jsonpath(json_tree, { time: ".time", level: “.level", message: ".message”, CorrelationID: “.CorrelationID", SessionID: ".SessionID”, Component: “.Component", ComponentVersion: ".ComponentVersion”, Action: “.Action", ResultCode: ".ResultCode”, DataAsJson: “.DataAsJson" , InitiatorIPAddress: ".InitiatorIPAddress” , MethodNameOrURL: “.MethodNameOrURL", MemberID: ".MemberID”, UserAgent: “.UserAgent", TimeSpentInMs: ".TimeSpentInMs”, BookNo: “.BookNo", SourceLine: "._source_line”, SourceMethod: “._source_method", EventDateAtOrigin: ".EventDateAtOrigin” });

set_field(“EventDateAtOrigin”, to_date(json_fields.EventDateAtOrigin));
set_field(“Level”, to_string(json_fields.Level));
set_field(“CorrelationID”, to_string(json_fields.CorrelationID));
set_field(“SessionID”, to_string(json_fields.SessionID));
set_field(“BookNo”, to_string(json_fields.BookNo));
set_field(“Hostname”, to_string(json_fields.Hostname));
set_field(“UserAgent”, to_string(json_fields.UserAgent));
set_field(“ResultCode”, to_long(json_fields.ResultCode));
set_field(“ComponentVersion”, to_string(json_fields.ComponentVersion));
set_field(“Component”, to_string(json_fields.Component));
set_field(“Action”, to_string(json_fields.Action));
set_field(“MemberID”, to_string(json_fields.MemberID));
set_field(“TimeSpentInMs”, to_long(json_fields.TimeSpentInMs));
set_field(“DataAsJson”, to_string(json_fields.DataAsJson));
set_field(“MethodNameOrURL”, to_string(json_fields.MethodNameOrURL));
set_field(“InitiatorIPAddress”, to_string(json_fields.InitiatorIPAddress));
set_field(“time”, to_string(json_fields.time));
set_field(“SourceLine”, to_string(json_fields.SourceLine));
set_field(“SourceMethod”, to_string(json_fields.SourceMethod));

set_field(“timestamp”, parse_date(substring(to_string(json_fields.time), 0, 23), “yyyy-MM-dd HH:mm:ss.SSS”));
remove_field(“time”);
end

Simulator shows:

Here’s where I see errors:

Or at least think it means errors evem 0 error/s(1145) could be understood in many ways, anyway, theres no changes to the messages when I search. No new or modified fields. (Do I need to cycle index?)
I dont know how to troubleshoot this. Do I need to look into Graylog server logs?

Sorry that I have to ask for more help…

//Peter


(Ben van Staveren) #9

Hi Peter,

The rule looks good except for the select_jsonpath, not sure if the forum software ate it, but you need to prefix the path with “$”.

And yes your view of the errors is correct, the throughput and errors are at 0/sec (probably because you aren’t putting anything through the stream), and over the lifetime of your pipeline it has picked up 1145 errors altogether.

Another reason it may error out is that the json parsing failed, I think that does show up in the main server log. You may need to take a look at that and throw some messages through and see what happens, and perhaps alter your rule after to have a more targeted condition - e.g. do a regex check for "^\\{.*\}}$" (keep in mind this is a little bit expensive to run) to ensure that the message field at least looks like JSON before attempting to parse it.

Other than that I can’t really think of anything that would cause the errors unfortunately :frowning:


(Jan Doberstein) #10

The rule looks good except for the select_jsonpath, not sure if the forum software ate it, but you need to prefix the path with “$”.

that is the reason you should use in this community to write code/rules

```
 your code / rule here
```

#11

We use different ports for different applications/functions, because it is easier to make stream about the input ID, you don’t need use multiple rules, or use regex about the hostnames. Also easier to debug the network traffic, and if you got a web server, you don’t need to administrate on graylog side, just set the 1234 port, and it will go the webserver stream.

But there is no good solution, it’s just my opinion.


(Ben van Staveren) #12

To borrow a Perl-ism, TIMTOWTDI :slight_smile: We are in the “lucky” position that all our input is through Beats, and that we tag things with additional fields before they get to Graylog. Our stream rules are all exact-match on one field - except 3 streams that have 2 exact-match rules. So in general just as fast as checking the input ID :slight_smile:

The reason we run it like that is that by way of the extra fields we know exactly which app is logging, and by extension the exact format it’s in, so we can tailor a pipeline exclusively to that (and extract additional data that we want).

I don’t really agree that there is “no” good solution, I’d like to rephrase that as “there are many good solutions, but which one is the best is subjective” :slight_smile:


#13

You are very lucky.
We have more than 40 streams, 600+ sources, 50+ different source type
Only the timestamp and the messages are the common fields in the logs. I’m lucky when the GL can phrase hostname.


(Ben van Staveren) #14

Damn son… yeah, that gets complicated fast!