First guess is that it was picked up by Elasticsearch as type keyword (String) when you wanted it to be a long. A simple solution may be to rotate the index so that it will be redefined as numeric on the next ingested message. It seems unlikely but you could create a custom mapping to enforce a numeric (long - for instance).
not sure that has anything to do with my issue. I have a .NET app sending in GELF formatted messages and I am pretty sure the lib used to do the gelf formatting is sending the timestamp as a string instead of a number. I have updated my appender to use <param name="SendTimeStampAsString" value="false"/> and will see if that prevents the warning
EDIT: Re-reading the first post, the error is happening on the input receiving the message? then you can ignore all of the below because that is after the fact (leaving it because I spend some good time writing it) . Perhaps it is an extractor on the Input that is thinking it is a string? It should come in as a number… here is a post that is semi relevant and also has a pipeline rule to change the epoc to something more helpful.
I believe what is going on here is Graylog receives the message and it has a Graylog defined message ID like 8ef9d1c0-27c9-11ed-8e61-0ee8466ead25 but when Graylog goes to store it in Elasticsearch the type is different. So Graylog sees the GELF timestamp as numeric or date or something like that and tries to send it to Elasticsearch as such. Elasticsearch says that field that it has for that is keyword (string) and then likely is rejecting the whole message. Assuming your GELF formatted messages are going to the correct Graylog GELF Input?
When you send in data to ElasticSearch from Graylog, Elasticsearch makes assumptions about the type of data coming in and from what I see it defaults to keyword. The things is it only makes this assumption once on the index when it is first started… so if it it guesses that it is a keyword at index start… it is only that and messages will be rejected if it is not. I think newer versions (that are incompatible with Graylog) care a lot less about this. (you can always go to Opensearch…which is the future of Graylog.)
So how did this happen? Usually it’s because as you are starting to pull in data the fields are off or something wasn’t explicitly defined initially as a numeric so Elasticsearch found/guessed a keyword and is sticking to it until the index is rotated. Maybe your input wasn’t GELF initially or maybe…
What can you do?
Rotate the index by going to System/Indices, clicking on the index you should be receiving the GELF messages on, click on “maintenance” in the upper right and chose “rotate active write index” This will make Graylog tell Elasticsearch to close the current index and start a new one… at which point Elasticsearch will evaluate each field that comes in and “timestamp” should be evaluated correctly. (Be aware of what your rotation/retention strategy is when you do this)
You could create a custom mapping in Elasticsearch… you shouldn’t need to do this though, by default the timestamp should be stored correctly GELF to GELF etc. I did to a post a while about about custom mapping and correctly historical information… here though if you are not getting the message stored, there is not historical information to correct. Still, there are curl commands in there to look at what is being stored in ElasticSearch so you can modify them to your index names and the timestamp field to get more detail on what is going on there.
You didn’t mention anything of this but it is possible that an extractor or pipeline rule could be setting up the fields incorrectly.
GELF is nothing new to Graylog as well as timestamps… it’s just tracking down where the processing isn’t meeting expectations… don’t loose the faith!
so what’s the point of that message ID field??? how am I supposed to use it to debug anything???
I am getting GELF messages from my hosts… many messages. So that tells me the input is setup correctly. I sure would hope that if my message is getting dropped I would get more than a WARN level message about the format of the timestamp
how can I see the raw message before GL tries to send it to OpenSearch (which by the way, is what I am using)?
Edited above message. Graylog usually shows errors like this on the system/overvew page. It would be better if there were alerts… which is likely possible but a different question.
Per edit - looks like it’s an issue at the input, with a particular host? Maybe with an extractor… are you using them? Or with the host sending in the wrong type? you could use TCPdump/wireshark to see data coming in…
my cluster has been up for weeks and rotation is daily, so it’s rolled indexes many times and this warning accounts for 99% of all graylog generated log messages. Makes debugging real issues impossible.
again I’ll ask, how / where can I use that message ID to see the actual message at the time it hits GL?
As you can see there are multiple post here in the forum on this issue. As @tmacgbay said, to trouble shoot this issue using Graylog Open it may take a few to find the issue.
The Graylog Enterprise/Operations does have a feature.
So yeah, It can be done BUT so long as your under 2GB a day its free.
Example, Marked in the red box, is the node that has failed. So I know what device it is.
I know of a couple ways to solve this and how long it will take depends on how big of an environment you have.
Can I ask what have you done beside rotate your indices?
Since MongoDb holds all the metadata, have you check MongoDb and/or set MongoDb up for debugging? You maybe able to find what source it is , but not 100% sure. My resolve was sending new devices to a Raw/Plaintext INPUT to see if any come though
I was looking into this, not sure if it will help but I accessed Mongo.
Steps I took for GL Open 4.3.
Logon to Mongo Shell
root #mongo
> use graylog
> show collections
> db.index_failures.find();
Results:
MongoDB shell version v4.4.16
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("f4fee408-5236-4ece-ab65-353da23cba6b") }
MongoDB server version: 4.4.16
---
The server generated these startup warnings when booting:
2022-08-19T20:43:44.375-05:00: You are running on a NUMA machine. We suggest launching mongod like this to avoid performance problems: numactl --interleave=all mongod [other options]
---
> use graylog
> show collections
> db.index_failures.find();
{ “_id” : ObjectId(“62ad2ae3777f8d5ed9cbfb1f”), “index” : “linuxserver_11”, “letter_id” : “d26ee42b-eea5-11ec-8588-00155d601d11”, “type” : “indexing”, “message” : “ElasticsearchException[Elasticsearch exception [type=mapper_parsing_exception, reason=failed to parse field [graylog] of type [boolean] in document with id ‘d26ee42b-eea5-11ec-8588-00155d601d11’. Preview of field’s value: ‘{}’]]; nested: ElasticsearchException[Elasticsearch exception [type=json_parse_exception, reason=Current token (START_OBJECT) not of boolean type\n at [Source: (byte)"eption[Elasticsearch exception [type=mapper_parsing_exception, reason=failed to parse field [graylog] of type [boolean] in document with id ‘b34daeff-eea5-11ec-8588-00155d601d11’. Preview of field’s value: ‘{}’]]; nested: ElasticsearchException[Elasticsearch exception [type=json_parse_exception\"[truncated 523788 bytes]; line: 1, column: 854]]];]","gl2_source_input":"5e265ada83d72ec570ab5fe2","EventReceivedTime":"2022-06-17 20:27:04","SourceModuleName":"graylog","graylog":{},"gl2_source_node":"8"[truncated 523788 bytes]; line: 1, column: 893]]];”, “timestamp” : ISODate(“2022-06-18T01:27:04Z”) }
Type “it” for more
this leads me to think the sender is just not sending a well formatted GELF message. GL is getting the message and “dealing with it”, but dropping a warning in the logs in an attempt at being “nice”. Problem is the details in that warning don’t help you actually figure out what’s happening.
I have Java apps and .NET apps sending in GELF messages to the same input and all the java ones get processed w/o issue. Every warning generated is coming from the .NET apps. I did notice the versions of log4net and gelf4net are very out of date…so I’m looking into that.
in the meantime, I still would like more clarification of what actually happened. Like, was the message received by the input and actually indexed. In other word, are the warning messages “paired” with each of my indexed messages?
I understand, BUT if a message is not being indexed because of a timestamp issue then this becomes a “indexing failure”.
Yes I understand this also, but to find the device sending these message that is creating those logs what were doing is suggesting where to look and suggesting on different idea you can execute to solve this issue.
I believe we answer you question already. you have a timestamp issue. This means that you have a device sending log/s your GELF input that is incompatible.
Good question, chances are it may have been dropped. What needs to happen is find the device or device’s that’s is creating this issue. We answered that already from the posts above.
Sorry @jasonrberk We can only offer suggestion. @tmacgbay and I offered suggestion on what and were to look, not much more we can do from here. My apologies. Maybe someone else here has a better idea.
The messages were dropped at the Input because the incoming message had a string where a numeric was expected. A message has to pass through all of Graylog before it is pushed back out to Elasticsearch and potentially get an index error.
Your messages are not making it in to Graylog so you can’t get an index error like the whole “Ignore that” section was talking about.