OK, So I figured out why my search for is not returning what I was expecting. I will do my best to synthesize it down to a few short lines for people to understand (this is how I will describe it to my end users), but the document that really made it all click for me is right here. I would suggest reading that to understand some more of the details.
I’m hoping that people smarter than me can confirm my understanding:
- When a log entry is stored in elasticsearch, there are searchable terms stored with that log entry, but the exact log entry itself is NOT DIRECTLY SEARCHABLE.
- To generate the searchable terms all of the non alpha-numeric characters are removed - except periods, and the remaining searchable terms are all changed to lowercase - Note: it’s actually a little more complicated than this, but that is the easiest rule for me to remember.
- For example this log entry:
10.111.111.111 - - [09/Sep/2017:01:43:37 +0000] "GET /index.html?test123=1111 HTTP/1.1" 200 14 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36"
- Actually gets indexed and searchable with these words:
10.111.111.111 8 09 sep 2017 01 43 37 0000 get index.html test123 1111 http 1.1 200 14 mozilla 5.0 x11 linux x86_64 applewebkit 537.36 khtml like gecko chrome 60.0.3112.101 safari 537.36
- That means you cannot search for punctuation, brackets, slashes, etc. So in my use case, the following log entries are all identical as far as elasticsearch searching is concerned: script,<script>,/script/,[script],;script;, etc - they all just get indexed and searchable as the script.
So my question for people smarter than me:
- Is my understanding above correct?
- Is there a way to do a regex or similar search on the actual stored log message, similar to how I would grep log files on my file system? Even if its slower.
- Assuming I cannot search the original log message itself, what is the best way for me to store terms with non-alphanumeric characters? Can I store some fields with the default analyzer/tokenizer, while storing other fields with a custom analyzer/tokenizer that leaves things like URI’s intact?
- If there is a way to store terms with non-alphanumeric characters intact in URI’s for example, is it a bad idea for some reason? I can absolutely understand in most applications of lucene why you wouldn’t want to search for non alphanumeric characters, but in the world of computer log files, these characters matter - a lot.
P.S. I think this needs to be called out explicitly in the graylog documentation. It’s mentioned in the search page, but it’s glossed over and assumes people know what “analyzed” and “non analyzed” fields are. Additionally, all the threads I read with people having search questions, they were just linked to (IMO) confusing documentation that assumed they were familiar with the details of how elasticsearch queries work, which I think is an unreasonable assumption for most people. I’m hoping my description will be helpful to people, and if it’s not accurate, I’m hoping someone will correct me in a way that is easy for folks to understand.