Searching for a complex string (like a UUID) returns no results

john2893 · December 8, 2022, 6:23pm

1. Describe your incident:
I’m reading JSON logs from a file using nxlog and sending them via GELF to graylog. The message field could look like this:

Getting authentication state user_uuid=1983dd8e-1a87-4220-9ae9-a1231c64c034

Performing searches on Graylog sometimes doesn’t return the expected result, so for example if I search authentication and getting the aforementioned log line appears, but if I search user_uuid, /user_uuid/, @message:"user_uuid=1983dd8e-1a87-4220-9ae9-a1231c64c034" or anything that should match the token, nothing appears in the results.

Moreover, if I search exactly the whole string, with @message:"Getting authentication state user_uuid=1980dd8e-1a87-4220-9ae9-a1239c64c0c4" for some reason it works. Changing a single character from that and say, replacing it with a ? makes the search fail again.

I’ve researched this problem, there’s no error log in graylog or elasticsearch, there’s apparently no configuration I can change. My hypothesis is that long tokens are not indexed.

Any clue?

2. Describe your environment:
Graylog 4, 5 in Docker. elasticsearch-oss:7.10.2

3. What steps have you already taken to try and solve the problem?
See above

4. How can the community help?
If anybody has any insight on why this happens, how to trace the problem (trace logs don’t seem to do much) or why it happens, it’d be extremely helpful, otherwise graylog has little to no value for my use case.

Thanks a lot

tmacgbay · December 8, 2022, 6:33pm

Looks like you are searching against the message: field which Graylog treats like a text field and does not break out (Some Inputs will break out SOME parts for you) @kingzacko1 answered something similar here. The short story is you need to parse out the message: field into it’s constituent parts and you can find information that way.

john2893 · December 8, 2022, 7:43pm

Thanks for the reply, I read something about that but it’s not like I have always the same parts in the message, for example user_uuid is an arbitrary key, and I can’t create a custom processing for every case.

Do you think it’s possible to change the way the message is indexed? Also please note that @message is not message, it’s a field parsed by nxlog when interpreting the JSON log entry. Could it be that it’s not indexed correctly?

joe.gross · December 8, 2022, 8:32pm

You can change the indexing settings for the message field, but we recommend against it. Unless you will only be collecting that one log type, it will result in a very large resource consumption.

Is there any continuity or predictability to the logs? If so, you should be able to write a JSON, GROK or RegEx parsing pipeline without knowing exactly what each message will say ahead of time.

Another option might be to take it in as JSON and parse that in Graylog instead of NXLog.

gsmith · December 9, 2022, 12:35am

Hey @john2893

I have done something similar with UUID’s in Graylog just to see if I could do it.
Created a Regex extractor and attached it to a look up table.

maybe a better explanation here.

Here is the widget I created from all that.

john2893 · December 9, 2022, 9:21am

It might be an idea to match key-value pairs by looking for the “=” sign and no spaces, is it possible to define arbitrary keys based on a regex match? Apologies if it’s a noob question, I’m having a hard time understanding graylog’s structure despite the good documentation.

joe.gross · December 9, 2022, 6:48pm

There is a long and a short answer to your question @john2893. The short answer is that the key value function will only work if the entire file is key=value with some kind of separator between pairs. Even a space will do, but it has to be consistent or the KV parser will fail.It can’t be mixed.

So,

field_name=value, field_name=value, field_name=value, field_name=value

works with the key value function.

But,

field_name=value Getting authentication state user_uuid=1983dd8e-1a87-4220-9ae9-a1231c64c03 field_name=value

would not work, since Getting authentication state is not part of a key/value pair.

Though it’ a hack, if the message field is always something short like your example above, you could simply copy the contents of the message field (to_string($message.message) to a new field, which would be indexed by default. If you do this, be sure to restrict the pipeline to only the messages in question using a very specific stream rule. If you don’t, it will copy every single message that comes in, doubling your ingestion.

All that said, if you can fix your parsing on NXLog, that would be the cleanest way to handle it. Then Graylog gets properly parsed info and will store that information in fields you can address directly.

tmacgbay · December 9, 2022, 8:22pm

In this instance you could use regex_replace() in a pipeline rule to remove the portion of the message that messes things up (if it’s consistent enough) and then apply the key/value to the remaining data… Just sayin…

Definitely!

john2893 · December 11, 2022, 3:04pm

Thanks for all the help, I’ve been thinking on a possible solution that requires minimal effort and isn’t a headache to maintain.

Parsing out text and then interpreting key/value pairs seems complex and error prone, but worth a try, I’ll keep it as a last chance
Since I wrote the application server, I can change the way key/value pairs are logged, maybe there’s a better way of adding context to JSON logs without writing those values in the message
@joe.gross I’m parsing JSON at the nxlog side (which then sends them via GELF), you think making graylog do the parsing would be more elastic? (no pun intended)
It’s just so weird that the rest of the message works, is this problem related to punctuation? Token length? What are the rules, in case something similar happens again? The documentation doesn’t say much it seems, or redirects to elasticsearch which is overly verbose.

Cheers

john2893 · December 11, 2022, 3:25pm

To debug the issue I tried creating an elasticsearch instance:

docker run --name es01 -p 9200:9200 -it docker.elastic.co/elasticsearch/elasticsearch:8.5.3

And create an index

curl -X "PUT" "https://localhost:9200/my_index2" \
     -H 'Content-Type: application/json; charset=utf-8' \
     -u 'elastic:FxGJegz+t3IiAtJsY3dV' \
     -d $'{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 255
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      }
    }
  }
}'

Then using the _analyze endpoint I checked which tokens were created:

curl -X "POST" "https://localhost:9200/my_index2/_analyze" \
     -H 'Content-Type: application/json; charset=utf-8' \
     -u 'elastic:FxGJegz+t3IiAtJsY4dV' \
     -d $'{
  "text": "Renewed token while checking user_uuid=aa7f0442-7706-4a6c-a840-893246a676af new_token=02b58ffd-2fca-41f2-b0da-d29911907a3a",
  "analyzer": "my_analyzer"
}'

To my surprise, the UUID tokens are there! They get split, but user_uuid is present (and doesn’t yield any result in graylog), same for the pieces of UUID.

{"token":"checking","start_offset":20,"end_offset":28,"type":"<ALPHANUM>","position":3},{"token":"user_uuid","start_offset":29,"end_offset":38,"type":"<ALPHANUM>","position":4},{"token":"aa7f0442","start_offset":39,"end_offset":47,"type":"<ALPHANUM>","position":5},{"token":"7706","start_offset":48,"end_offset":52,"type":"<NUM>","position":6},

john2893 · December 21, 2022, 1:06pm

Bumping (just one time): any idea why the indexing seems to work but the search doesn’t? I’d be okay with the “very large resource consumption” implied by indexing the message differently, otherwise Graylog would be almost useless for running queries, which we’re using for debugging purposes. Is there a tested guide for that?

Thanks

gsmith · December 21, 2022, 11:36pm

Hey,

Jst a guess elasticsearch:8.5.3 perhaps the ES version your using does not work. The supported version is elasticsearch:7.10.x.

The UUID’s are in mongo, you could extrac them and send it to Graylog.
Below is an example of getting data from INPUT, depending on what “collection” you would like

Graylog_Collections

auth_service_backends
cluster_config
cluster_events
collector_configurations
collectors
content_packs
content_packs_installations
dashboards
decorators
event_definitions
event_notification_status
event_notifications
event_processor_state
export_jobs
forwarder_input_profiles
forwarder_input_states
forwarders
grants
graylog
grok_patterns
index_failures
index_field_types
index_ranges
index_sets
inputs
ldap_settings
licenses
lut_caches
lut_data_adapters
lut_mongodb_data_adapter
lut_tables
nodes
notifications
opensearch_anomaly_detectors
outputs
pipeline_processor_pipelines
pipeline_processor_pipelines_streams
pipeline_processor_rules
processing_status
roles
saved_searches
scheduler_job_definitions
scheduler_triggers
searches
security_views
sessions
sidecar_collector_actions
sidecar_collectors
sidecar_configuration_variables
sidecar_configurations
sidecars
streamrules
streams
system.profile
system_messages
team_sync_backend_configs
teams
traffic
users
view_sharings
views

mongoexport  -u mongo_admin -p  password --collection=inputs --db=graylog --pretty --out=inputs.json

tmacgbay · December 22, 2022, 3:13pm

It seems as though you are looking to be able to input blocks of information in JSON format into Graylog and be able to search against them. Like a regex front end. Graylog isn’t really built for that. It is built around the processing of a message into it’s constituent parts (fields) and then storing in and searching against those fields. Graylog can do some searching of data inside fields but it wasn’t really built for that.

If you are writing and formatting your own logs, you should be able to build in some consistency that would allow you to do extract fields. Perhaps I am missing something? Can you show how your data looks in Graylog after it is received/stored?

joe.gross · December 27, 2022, 4:26pm

If you wrote the application server, why not output directly to GELF and skip the parsing headache? I am not a dev, nor do I play one on TV, but I know there are modules out there that allow GELF output. Take a look at GELF, it’s essentially JSON itself. You should be able to use it for the same thing.

The golden rule of logging is that the closer the solution is to the source of the message, the better. If you can fix the source, all the secondary issues vanish in a puff of formatting.

john2893 · December 27, 2022, 5:29pm

You know what, it could actually be the smartest idea. Parsing log files is already a hassle, and having one moving part less sounds enticing. That way I can pass the k/v data, which I’m already separating.

@tmacgbay I can understand why you say that, it’s funny though that from a product’s perspective I’m more able to perform (slow but practical) queries on a SQL database that I can’t do with this backend apparently. Extracting fields is an excellent idea, sometimes though it’s just impractical because somebody else is writing the logs…

Thanks all for the answers, it helps to get a better understanding.

john2893 · December 30, 2022, 3:52pm

Update: I found something suspicious, the message is stored truncated. No idea if it’s the cause of the missed indexing, but it fits well the hypothesis.
Gonna switch to GELF in the future and see how it works. Hope this can help others in case.

Screenshot 2022-12-30 at 16.47.24

john2893 · January 1, 2023, 9:01pm

Yup, after switching to GELF everything works, even if the UUID is written in the message instead of k/v pairs. This is a lot better, faster and less error prone than using log file readers. Thank you all for the suggestions.

system · January 15, 2023, 9:01pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog search question Graylog Central (peer support)	6	2243	August 25, 2017
Elasticsearch custom index mapping Graylog Central (peer support) elastic	28	2701	February 14, 2023
Identify source and reason of bad messages Graylog Central (peer support)	9	3353	February 23, 2017
Searching Custom JSON Fields Graylog Central (peer support)	3	5402	February 14, 2018
Failed to find a string in message Graylog Central (peer support)	4	1419	April 12, 2019

Searching for a complex string (like a UUID) returns no results

Related topics