Searching for a complex string (like a UUID) returns no results

1. Describe your incident:
I’m reading JSON logs from a file using nxlog and sending them via GELF to graylog. The message field could look like this:

Getting authentication state user_uuid=1983dd8e-1a87-4220-9ae9-a1231c64c034

Performing searches on Graylog sometimes doesn’t return the expected result, so for example if I search authentication and getting the aforementioned log line appears, but if I search user_uuid, /user_uuid/, @message:"user_uuid=1983dd8e-1a87-4220-9ae9-a1231c64c034" or anything that should match the token, nothing appears in the results.

Moreover, if I search exactly the whole string, with @message:"Getting authentication state user_uuid=1980dd8e-1a87-4220-9ae9-a1239c64c0c4" for some reason it works. Changing a single character from that and say, replacing it with a ? makes the search fail again.

I’ve researched this problem, there’s no error log in graylog or elasticsearch, there’s apparently no configuration I can change. My hypothesis is that long tokens are not indexed.

Any clue?

2. Describe your environment:
Graylog 4, 5 in Docker. elasticsearch-oss:7.10.2

3. What steps have you already taken to try and solve the problem?
See above

4. How can the community help?
If anybody has any insight on why this happens, how to trace the problem (trace logs don’t seem to do much) or why it happens, it’d be extremely helpful, otherwise graylog has little to no value for my use case.

Thanks a lot

Looks like you are searching against the message: field which Graylog treats like a text field and does not break out (Some Inputs will break out SOME parts for you) @kingzacko1 answered something similar here. The short story is you need to parse out the message: field into it’s constituent parts and you can find information that way.

Thanks for the reply, I read something about that but it’s not like I have always the same parts in the message, for example user_uuid is an arbitrary key, and I can’t create a custom processing for every case.

Do you think it’s possible to change the way the message is indexed? Also please note that @message is not message, it’s a field parsed by nxlog when interpreting the JSON log entry. Could it be that it’s not indexed correctly?

You can change the indexing settings for the message field, but we recommend against it. Unless you will only be collecting that one log type, it will result in a very large resource consumption.

Is there any continuity or predictability to the logs? If so, you should be able to write a JSON, GROK or RegEx parsing pipeline without knowing exactly what each message will say ahead of time.

Another option might be to take it in as JSON and parse that in Graylog instead of NXLog.

1 Like

Hey @john2893

I have done something similar with UUID’s in Graylog just to see if I could do it. :wink:
Created a Regex extractor and attached it to a look up table.

maybe a better explanation here.

Here is the widget I created from all that.

It might be an idea to match key-value pairs by looking for the “=” sign and no spaces, is it possible to define arbitrary keys based on a regex match? Apologies if it’s a noob question, I’m having a hard time understanding graylog’s structure despite the good documentation.

There is a long and a short answer to your question @john2893. The short answer is that the key value function will only work if the entire file is key=value with some kind of separator between pairs. Even a space will do, but it has to be consistent or the KV parser will fail.It can’t be mixed.

So,

field_name=value, field_name=value, field_name=value, field_name=value

works with the key value function.

But,

field_name=value Getting authentication state user_uuid=1983dd8e-1a87-4220-9ae9-a1231c64c03 field_name=value

would not work, since Getting authentication state is not part of a key/value pair.

Though it’ a hack, if the message field is always something short like your example above, you could simply copy the contents of the message field (to_string($message.message) to a new field, which would be indexed by default. If you do this, be sure to restrict the pipeline to only the messages in question using a very specific stream rule. If you don’t, it will copy every single message that comes in, doubling your ingestion.

All that said, if you can fix your parsing on NXLog, that would be the cleanest way to handle it. Then Graylog gets properly parsed info and will store that information in fields you can address directly.

1 Like

In this instance you could use regex_replace() in a pipeline rule to remove the portion of the message that messes things up (if it’s consistent enough) and then apply the key/value to the remaining data… Just sayin…

Definitely!

Thanks for all the help, I’ve been thinking on a possible solution that requires minimal effort and isn’t a headache to maintain.

  • Parsing out text and then interpreting key/value pairs seems complex and error prone, but worth a try, I’ll keep it as a last chance
  • Since I wrote the application server, I can change the way key/value pairs are logged, maybe there’s a better way of adding context to JSON logs without writing those values in the message
  • @chris.black-gl I’m parsing JSON at the nxlog side (which then sends them via GELF), you think making graylog do the parsing would be more elastic? (no pun intended)
  • It’s just so weird that the rest of the message works, is this problem related to punctuation? Token length? What are the rules, in case something similar happens again? The documentation doesn’t say much it seems, or redirects to elasticsearch which is overly verbose.

Cheers

To debug the issue I tried creating an elasticsearch instance:

docker run --name es01 -p 9200:9200 -it docker.elastic.co/elasticsearch/elasticsearch:8.5.3

And create an index

curl -X "PUT" "https://localhost:9200/my_index2" \
     -H 'Content-Type: application/json; charset=utf-8' \
     -u 'elastic:FxGJegz+t3IiAtJsY3dV' \
     -d $'{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 255
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      }
    }
  }
}'

Then using the _analyze endpoint I checked which tokens were created:

curl -X "POST" "https://localhost:9200/my_index2/_analyze" \
     -H 'Content-Type: application/json; charset=utf-8' \
     -u 'elastic:FxGJegz+t3IiAtJsY4dV' \
     -d $'{
  "text": "Renewed token while checking user_uuid=aa7f0442-7706-4a6c-a840-893246a676af new_token=02b58ffd-2fca-41f2-b0da-d29911907a3a",
  "analyzer": "my_analyzer"
}'

To my surprise, the UUID tokens are there! They get split, but user_uuid is present (and doesn’t yield any result in graylog), same for the pieces of UUID.

{"token":"checking","start_offset":20,"end_offset":28,"type":"<ALPHANUM>","position":3},{"token":"user_uuid","start_offset":29,"end_offset":38,"type":"<ALPHANUM>","position":4},{"token":"aa7f0442","start_offset":39,"end_offset":47,"type":"<ALPHANUM>","position":5},{"token":"7706","start_offset":48,"end_offset":52,"type":"<NUM>","position":6},

Bumping (just one time): any idea why the indexing seems to work but the search doesn’t? I’d be okay with the “very large resource consumption” implied by indexing the message differently, otherwise Graylog would be almost useless for running queries, which we’re using for debugging purposes. Is there a tested guide for that?

Thanks :slight_smile:

Hey,

Jst a guess elasticsearch:8.5.3 perhaps the ES version your using does not work. The supported version is elasticsearch:7.10.x.

The UUID’s are in mongo, you could extrac them and send it to Graylog.
Below is an example of getting data from INPUT, depending on what “collection” you would like

Graylog_Collections
auth_service_backends
cluster_config
cluster_events
collector_configurations
collectors
content_packs
content_packs_installations
dashboards
decorators
event_definitions
event_notification_status
event_notifications
event_processor_state
export_jobs
forwarder_input_profiles
forwarder_input_states
forwarders
grants
graylog
grok_patterns
index_failures
index_field_types
index_ranges
index_sets
inputs
ldap_settings
licenses
lut_caches
lut_data_adapters
lut_mongodb_data_adapter
lut_tables
nodes
notifications
opensearch_anomaly_detectors
outputs
pipeline_processor_pipelines
pipeline_processor_pipelines_streams
pipeline_processor_rules
processing_status
roles
saved_searches
scheduler_job_definitions
scheduler_triggers
searches
security_views
sessions
sidecar_collector_actions
sidecar_collectors
sidecar_configuration_variables
sidecar_configurations
sidecars
streamrules
streams
system.profile
system_messages
team_sync_backend_configs
teams
traffic
users
view_sharings
views
mongoexport  -u mongo_admin -p  password --collection=inputs --db=graylog --pretty --out=inputs.json

It seems as though you are looking to be able to input blocks of information in JSON format into Graylog and be able to search against them. Like a regex front end. Graylog isn’t really built for that. It is built around the processing of a message into it’s constituent parts (fields) and then storing in and searching against those fields. Graylog can do some searching of data inside fields but it wasn’t really built for that.

If you are writing and formatting your own logs, you should be able to build in some consistency that would allow you to do extract fields. Perhaps I am missing something? Can you show how your data looks in Graylog after it is received/stored?

1 Like

If you wrote the application server, why not output directly to GELF and skip the parsing headache? I am not a dev, nor do I play one on TV, but I know there are modules out there that allow GELF output. Take a look at GELF, it’s essentially JSON itself. You should be able to use it for the same thing.

The golden rule of logging is that the closer the solution is to the source of the message, the better. If you can fix the source, all the secondary issues vanish in a puff of formatting.

2 Likes

You know what, it could actually be the smartest idea. Parsing log files is already a hassle, and having one moving part less sounds enticing. That way I can pass the k/v data, which I’m already separating.

@tmacgbay I can understand why you say that, it’s funny though that from a product’s perspective I’m more able to perform (slow but practical) queries on a SQL database that I can’t do with this backend apparently. Extracting fields is an excellent idea, sometimes though it’s just impractical because somebody else is writing the logs…

Thanks all for the answers, it helps to get a better understanding.

1 Like

Update: I found something suspicious, the message is stored truncated. No idea if it’s the cause of the missed indexing, but it fits well the hypothesis.
Gonna switch to GELF in the future and see how it works. Hope this can help others in case.

Screenshot 2022-12-30 at 16.47.24

Yup, after switching to GELF everything works, even if the UUID is written in the message instead of k/v pairs. This is a lot better, faster and less error prone than using log file readers. Thank you all for the suggestions.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.