Stumbled upon this gem, have a few q's about Graylog

Hello Graylog and Community,

I come from a background in leveraging an event logging and analytics product that starts with an S and requires sacrifice of a kidney to afford. I have been looking for open source alternatives that intend to disrupt the space and just generally iterate and get better and better over time. I have seen ELK and its okay, but I wanted a more ootb easier to maintain solution with great documentation that also has an open core approach with driven engineers but also has an enterprise following big enough to pay the bills and keep the lights on(and hopefully they are pro open source too!). I like a team running the project that believes in open source and enjoys a community of users acting as great QA and even helping build and PR awesome functionality once familiarized with the application. I get the feeling Graylog fits into this and I am very excited to try and dig into it a bit and POC it potentially in the near future. I do have a few questions, I think many are not super repetitive with older searching but forgive me if they are and link me to an old post if possible.

I work in the API and API Gateway space so most of my questions are geared around that

The some features I am looking for out of the open-source are:

  1. Programmatic dash-boarding: Say a customer calls tons of API’s against an API Gateway technology. Can for every new Customer I have on the gateway a programmatic dashboard be made in Graylog(based on json/xml or something posted to an endpoint) that will create a dashboard that customer can later view on a unique URL path(that is predictable) for things like what API’s they call, maybe HTTP status codes they are seeing etc.

I see info here: http://docs.graylog.org/en/3.0/pages/dashboards.html , which visually tells me how to do it all. But say I make 1 dashboard visually(maybe it exports and JSON/XML right) and at that point there is 1 parameter I would like to use to define tons of custom dashboards(maybe by consumer username) to give each consumer a unique dashboard showcasing their API Experience. Then an API endpoint by graylog can be exposed and I POST that JSON fomat with a different consumer name embedded in and BAM I get an awesome dashboard! In the works? New idea not been considered?

Edit - Ooo Found the REST API Docs page - https://docs.graylog.org/en/3.1/pages/configuration/rest_api.html#graylog-rest-api , which tells me I need to install it and start browsing the swagger page lol, guess that keeps you from having to maintain all the things it supports directly in the docs :stuck_out_tongue: . I’ll keep my fingers crossed that dashboard control is present and the UI is just interacting with the API to produce dashboards!

  1. LDAP control for admins vs read rights users. - Answering this one myself, seems a resounding yes, go Graylog!

  2. HTTP intake endpoint accepting JSON format - Yay I do see it here:
    https://docs.graylog.org/en/3.1/pages/gelf.html#example-payload

My follow up question, is much like other solutions, does Graylog allow “batching” of these JSON messages to send multiple messages in a single call? Like:

{ ... }{ ... }{ ... }{ ... }

Not a hard requirement but sure would be nice if I could send say 10-20 messages in a single call since they are gonna not be too big.

Edit - Ahh see it here, no dice on multiple messages, the PR has kinda died off once it was mentioned a single JSON payload prepared by client and server having to read into memory is dangerous: https://github.com/Graylog2/graylog2-server/pull/5924 , although its still not a breaking change from existing behavior or if a single payload was huge anyways. One of those features that if you use it then understand the risks. I would say a 20kb or so batched log payload isn’t that big in todays world :stuck_out_tongue: . Avg API size calls I see in the wild are around 9kb. Cool to know I can fork and use this PR though if I want batched support right now in a structure like:

[
  {"version": "1.1", "host": "..."},
  {"version": "1.1", "host": "..."},
  {"version": "1.1", "host": "..."}
]

Which is actually more conformant that our existing solution that does invalid JSON but essentially multiple json messages like the earlier sample I gave above.

Also it seems the JSON structure Graylog accepts is only flat JSON if I am just going by the sample?

Something like:

{
"Tries":[{"balancer_latency":0,"port":443,"balancer_start":1570940875143,"ip":"10.xxx.xxx.xxx"},...]
}

Would not work right? I could also flatten it too to be like tries1,2,3,4,5 max as flat design but would be nice to support a more complex . - Yeah seems I can answer my own Q here actually: https://github.com/Graylog2/graylog2-server/issues/5945 , and well since I am in control of the JSON payload I send fully I can flatten everything out :slight_smile: , no major issues here but I think enabling structured JSON will help take Graylog to new heights!

  1. TLS / Auth for HTTP log intake : Easy enough, in my mind I will throw a Kong gateway side car to serve TLS endpoint and provide key auth or oauth2/jwt auth and reverse proxy localhost to the GELF endpoint. Huzzah!

  2. User Experience: Say I use the HTTP JSON logging, in the UI when browsing JSON events is Graylog smart enough to natively pretty print that JSON to be pleasingly viewable?

  3. How advanced does the querying get? I understand I kinda select fields and do various operations on them. Can graylog essentially do any of these?

Basic Wildcard searching on values.

index=XXX URI=*myurl/resource/endpoint1* OR URI=*myurl/resource/endpoint2*

Piping searches to produce basic visuals, like the below would make a bar chart of HTTP Statuses

index=XXX URI=*myurl/resource/endpoint1* OR URI=*myurl/resource/endpoint2* | chart count by HTTPStatus

Lastly something like

index=XXX URI=*myurl/resource/endpoint1* OR URI=*myurl/resource/endpoint2* | timechart p50(BackendLatency) p95(BackendLatency) p99(BackendLatency)

Which would yield p50/95/99 pretty time chart over time to review latency results visually.

Of course I don’t need a queryable language like so but can Graylog in its own unique way produce the same end results is the important part. Also assuming things like greater than/less than operands can be supported on numeric JSON value elements too etc(hopefully correct :slight_smile: ).

  1. Performance: Say I am generating about 200 mil tx a day, max tps around 3000 TPS. Does this seem in the ballpark of something the open-core HTTP event logging can handle? Log message size is around 600-800 bytes per message. Anyone leveraging grayscale that might be able to ballpark if such volume could be handled on a single beefy CentOS/RHEL VM with X CPU , X RAM , X Disk for the graylog, elastic, and mongo? That is like 0.16 TB of data per day. I would like a retention period of 3 months as a start.

  2. RESTFul management / Data Retrieval: My understanding is there is a REST API to pull event logged data too from things or get aggregate details on a response? So like is it possible to have all events logged, then do a REST query on the event data like:

  • Count of transactions over time for all consumers of the API
  • Count of transaction over time for a specific consumer of an API
  • P50/95/99 latency perf of the API

Will keep digging and edit any of these posts if I answer my own questions too, there is certainly a ton to comb over. Looking forward to getting involved!

Thanks,
Jeremy

Hmm one hindrance I am seeing early too is the Role to Dashboard management. Right now I can create a role, which can then have view or write roles on various dashboards. Would be super nice if I could make a role that can read ALL existing dashboards without me having to explicitly go in and auto reconfigure this group programmatically to have access to a new dashboard. I can make an API call I assume to do so but if I could just do it 1 time in admin panel with a selector bubble that is grant view access to all dashboards for this group that would be desirable.

Edit - So I can answer #5 of my q’s too after playing around with it. All comes in a table view when making searches, maybe yall would allow a UI PR for a checkbox that would enable JSON pretty format of the logs rather than a plain table view. The table naturally causes a decent stretch on the page <- -> where a JSON view of events w more details goes top down. Maybe its just a preference because of what I am used to :smile:

3000 TPS

that is not really very much - I know multiple installations that have double or triple of that. Not on a single node, but in the complete cluster. How many servers you need - depends what normalization you do and what cpu power you need for crunching the number.

1 Like

Thanks Jan really appreciate the feedback,

Seems reasonable to stand up a cluster if POC shows promise. Glad to hear there are use cases in the wild handling as much as 9000 TPS in HTTP event logging. As for normalization just lots of averages, percentiles of latencies and just standard key search kinda queries with 4-5 various combinations of parameter search ANDs / ORs etc. Hoping querying on say 3 months worth of logs (probably 14.4 TB of data at that span) can still be fairly quick. I know with our existing solution at 3 months it gets fairly sluggish but then again its a shared service so there is probably tons of data cause many apps have indexes, I see Graylog use case for me more-so federated on their own infra each time will be nice separation for folks to run their own instances.

Stumbled my way into making this so far with a number of button clicks, taking me awhile to get a feel for this but slowly I will get there!:

Wish after making a chart like this I could capture it as a query to do at the top vs seeing it in the UI though. Would like to be able to produce things like this with as minimal clicking and browsing around and combining charts as possible.

For the life of me I cannot figure out how to do percentiles. If I want to get the p99 / p95 of a given minutes data points and chart that(rather than the min/max/mean) how can I chart that?

Also as another win I now know the first query question I had can be accomplished by doing something like so:

URI:*manage\/health* OR URI:*myurl\/resource\/endpoint2*

Had to enable that config value to do frontside regex though.

Okay I think I understand how to do some of the more complicated charting, its using views(which seems detached from dashboarding in the sense these charts can’t make it into a dashboard?) . I managed to figure out how to do a p99 vs p95 chart, I noticed if I tried to rename the title in the UI to anything it didn’t work, I could add text at the end but not the front, it always reverted it with that default string stuff:

Trying to do a p99 and p95 in the same chart yields:

While retrieving data for this widget, the following error(s) occurred:
Two sibling aggregations cannot have the same name: [ded2c9bc-f452-45ea-a4f0-c4d818e18c0d-series-percentile(BackendLatency)].

Hoping the raw data for these can be grabbed from the API too. Is there a reason dashboards can’t get these aggregation views charts? The aggregate chart I posted earlier would love to be able to do p99/p95/p50 etc in the same chart view .

@jan

Still unable to find how to do a histogram of percentile data p90/p95 etc. of numeric data via REST API. Something like:

index=XXX URI="*somePathHere*" | timechart span=10s p90(BackendLatency),  p95(BackendLatency),  p99(BackendLatency)

Is this supported in Graylog? I saw how to do percentiles in the views aggregator page but that seems to not be exposed in the REST API?

Any guidance would be appreciated.

I did see how to do a histogram count for example:

http://server.com:9000/api/search/universal/relative/histogram?query=ServiceName%3AsomeValue&interval=minute&range=86400

Returned:

{
  "interval": "minute",
  "results": {
    "1571164560": 22,
    "1571164620": 0,
    "1571164680": 0,
    "1571164740": 0,
    "1571164800": 0,
    "1571164860": 0,
    "1571164920": 0,
    "1571164980": 0,
    "1571165040": 0,
    "1571165100": 0,
    "1571165160": 0,
    "1571165220": 0,
    "1571165280": 0,
    "1571165340": 0,
    "1571165400": 0,
    "1571165460": 0,
    "1571165520": 0,
    "1571165580": 0,
    "1571165640": 0,
    "1571165700": 0,
    "1571165760": 0,
    "1571165820": 0,
    "1571165880": 0,
    "1571165940": 0,
    "1571166000": 0,
    "1571166060": 0,
    "1571166120": 0,
    "1571166180": 0,
    "1571166240": 0,
    "1571166300": 0,
    "1571166360": 0,
    "1571166420": 0,
    "1571166480": 0,
    "1571166540": 0,
    "1571166600": 0,
    "1571166660": 0,
    "1571166720": 0,
    "1571166780": 0,
    "1571166840": 0,
    "1571166900": 0,
    "1571166960": 0,
    "1571167020": 0,
    "1571167080": 0,
    "1571167140": 0,
    "1571167200": 0,
    "1571167260": 0,
    "1571167320": 0,
    "1571167380": 0,
    "1571167440": 0,
    "1571167500": 0,
    "1571167560": 0,
    "1571167620": 0,
    "1571167680": 0,
    "1571167740": 0,
    "1571167800": 0,
    "1571167860": 0,
    "1571167920": 0,
    "1571167980": 0,
    "1571168040": 0,
    "1571168100": 0,
    "1571168160": 67,
    "1571168220": 0,
    "1571168280": 0,
    "1571168340": 2,
    "1571168400": 36
  },
  "time": 14,
  "built_query": "{\n  \"from\" : 0,\n  \"query\" : {\n    \"bool\" : {\n      \"must\" : [\n        {\n          \"query_string\" : {\n            \"query\" : \"ServiceName:someValue\",\n            \"fields\" : [ ],\n            \"use_dis_max\" : true,\n            \"tie_breaker\" : 0.0,\n            \"default_operator\" : \"or\",\n            \"auto_generate_phrase_queries\" : false,\n            \"max_determinized_states\" : 10000,\n            \"allow_leading_wildcard\" : true,\n            \"enable_position_increments\" : true,\n            \"fuzziness\" : \"AUTO\",\n            \"fuzzy_prefix_length\" : 0,\n            \"fuzzy_max_expansions\" : 50,\n            \"phrase_slop\" : 0,\n            \"escape\" : false,\n            \"split_on_whitespace\" : true,\n            \"boost\" : 1.0\n          }\n        }\n      ],\n      \"filter\" : [\n        {\n          \"bool\" : {\n            \"must\" : [\n              {\n                \"range\" : {\n                  \"timestamp\" : {\n                    \"from\" : \"2019-10-14 21:17:09.181\",\n                    \"to\" : \"2019-10-15 21:17:09.181\",\n                    \"include_lower\" : true,\n                    \"include_upper\" : true,\n                    \"boost\" : 1.0\n                  }\n                }\n              }\n            ],\n            \"disable_coord\" : false,\n            \"adjust_pure_negative\" : true,\n            \"boost\" : 1.0\n          }\n        }\n      ],\n      \"disable_coord\" : false,\n      \"adjust_pure_negative\" : true,\n      \"boost\" : 1.0\n    }\n  },\n  \"aggregations\" : {\n    \"gl2_histogram\" : {\n      \"date_histogram\" : {\n        \"field\" : \"timestamp\",\n        \"interval\" : \"1m\",\n        \"offset\" : 0,\n        \"order\" : {\n          \"_key\" : \"asc\"\n        },\n        \"keyed\" : false,\n        \"min_doc_count\" : 0\n      }\n    }\n  },\n  \"highlight\" : {\n    \"fragment_size\" : 0,\n    \"number_of_fragments\" : 0,\n    \"require_field_match\" : false,\n    \"fields\" : {\n      \"*\" : { }\n    }\n  }\n}",
  "queried_timerange": {
    "from": "2019-10-14T21:17:09.181Z",
    "to": "2019-10-15T21:17:09.181Z"
  }
}

Oof also tried to change some of the default index rotation and indices and messages strategies for deletion. Never could get anything to happen(rotate my index/delete old messages) when editing them and when I finally gave up and tried to run the rotate active write index manually that rekt things. Logs showed erros frm the bg process saying it could not index graylog_0 and such and the index page that should show me things just kept spinning after. Ended up deleting the graylog_0 index after I shut graylog down and just restarted graylog. Interesting times.

Now I am onto graylog_4 , following this guide trying to get things in the POC environment somewhat functional: https://github.com/Graylog2/graylog2-server/issues/5140 , I attempted to get things back in sync and it should be logging events as 03:00:00 … UTC OCT 22, but somehow its logging things like its way in the past. very strange. Edit - Tried deleting the GELF HTTP Log input and re-making and that didn’t help either.

how did you ingest messages?

does that containe a timestamp? Is the event timestamp in sync with the timestamp that is logged?

1 Like

False alarm. My app was sending the correct logs by timestamp, it just somehow got like 8+ hours delayed, very odd. Restarted the app doing the logging via HTTP JSON to the GELF endpoint and things caught back up. Sorry for the trouble, was really thinking it had something to do with the fact I had to delete the graylog_0 index manually in elastic and reboot graylog and get it to generate a new index etc. Potentially taking graylog down for a bit may have been what caused the odd backup behavior on my app side, but graylog was not the culprit here :smile: .

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.