Don’t forget to select tags to help index your topic!
1. Describe your incident:
This all started with some missing logs from our DCs; then I noticed a fairly high error rate across all nodes.
A quick visit to the Graylog log (log?) - netted this repeating pattern:
2022-01-20 07:19:12,659 ERROR o.g.s.b.p.DecodingProcessor [processbufferprocessor-4] Unable to decode raw message RawMessage{id=2d56c725-79eb-11ec-a97d-0024e8754cf8, journalOffset=15377474872, codec=gelf, payloadSize=307, timestamp=2022-01-20T12:19:12.658Z, remoteAddress=/10.0.0.14:44116} on input <59b541e99b755d65b77fe8f6>.
2022-01-20 07:19:12,659 ERROR o.g.s.b.p.DecodingProcessor [processbufferprocessor-4] Error processing message RawMessage{id=2d56c725-79eb-11ec-a97d-0024e8754cf8, journalOffset=15377474872, codec=gelf, payloadSize=307, timestamp=2022-01-20T12:19:12.658Z, remoteAddress=/10.0.0.14:44116}
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('?' (code 65533 / 0xfffd)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: (String)"????l'??#??#x?:?&o*mu?:?a?V????C??e?|,?dU??/?,M?l?(_??9y??kRbt?N?^)&????Vf?lu#?A?? &?d?????? ?y??YS?3????;x?(&?Q???7???y??o?u?xWv??AO+l?{???
?I}??Ii?I-:?????? ???9P?f?4 ?G
?x??9?BF????????oM??GF=y???@??g??k????0-?O?????
??"; line: 1, column: 2] ?c??xq??) n?Z?\J??ZA??? ?
Input in question is a TCP input for Windows Event logs specifically from DCs. They were previously sent to a UDP input, but I wanted to be able to get a better idea of what was happening to the missing logs.
Logs are forwarded from Windows by NXLog sending logs in GELF format using the TCP output module.
All logs sent to GL are passed through a frontend load-balancer running haproxy(tcp) and nginx(udp) - this is where I first noticed the RSTs:
A few packet captures later and I found the RSTs are always initiated from the GL nodes and roughly evenly distributed across all 3 nodes. I do not see this on any other TCP inputs.
2. Describe your environment:
-
OS Information:
FreeBSD 12.2-RELEASE -
Package Version:
GL 4.0.6
Elastic: 6.8.15
- Service logs, configurations, and environment variables:
3. What steps have you already taken to try and solve the problem?
disabled local host firewall - no change in behaviour
review sysctl tcp knobs to see if anything is sub-optimal
multiple packet captures to review for anything obvious
4. How can the community help?
Any ideas how I can determine what the content of the message is that is causing the decode error would be helpful.
Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]