High Error Rate and TCP RSTs (oh my!)

Don’t forget to select tags to help index your topic!

1. Describe your incident:

This all started with some missing logs from our DCs; then I noticed a fairly high error rate across all nodes.

image

A quick visit to the Graylog log (log?) - netted this repeating pattern:

2022-01-20 07:19:12,659 ERROR o.g.s.b.p.DecodingProcessor [processbufferprocessor-4] Unable to decode raw message RawMessage{id=2d56c725-79eb-11ec-a97d-0024e8754cf8, journalOffset=15377474872, codec=gelf, payloadSize=307, timestamp=2022-01-20T12:19:12.658Z, remoteAddress=/10.0.0.14:44116} on input <59b541e99b755d65b77fe8f6>.
2022-01-20 07:19:12,659 ERROR o.g.s.b.p.DecodingProcessor [processbufferprocessor-4] Error processing message RawMessage{id=2d56c725-79eb-11ec-a97d-0024e8754cf8, journalOffset=15377474872, codec=gelf, payloadSize=307, timestamp=2022-01-20T12:19:12.658Z, remoteAddress=/10.0.0.14:44116}
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('?' (code 65533 / 0xfffd)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: (String)"????l'??#??#x?:?&o*mu?:?a?V????C??e?|,?dU??/?,M?l?(_??9y??kRbt?N?^)&????Vf?lu#?A?? &?d?????? ?y??YS?3????;x?(&?Q???7???y??o?u?xWv??AO+l?{???
?I}??Ii?I-:?????? ???9P?f?4 ?G
                              ?x??9?BF????????oM??GF=y???@??g??k????0-?O?????
??"; line: 1, column: 2]                                                     ?c??xq??) n?Z?\J??ZA??? ?

Input in question is a TCP input for Windows Event logs specifically from DCs. They were previously sent to a UDP input, but I wanted to be able to get a better idea of what was happening to the missing logs.

Logs are forwarded from Windows by NXLog sending logs in GELF format using the TCP output module.

All logs sent to GL are passed through a frontend load-balancer running haproxy(tcp) and nginx(udp) - this is where I first noticed the RSTs:

image

A few packet captures later and I found the RSTs are always initiated from the GL nodes and roughly evenly distributed across all 3 nodes. I do not see this on any other TCP inputs.

2. Describe your environment:

  • OS Information:
    FreeBSD 12.2-RELEASE

  • Package Version:

GL 4.0.6
Elastic: 6.8.15

  • Service logs, configurations, and environment variables:

3. What steps have you already taken to try and solve the problem?

disabled local host firewall - no change in behaviour
review sysctl tcp knobs to see if anything is sub-optimal
multiple packet captures to review for anything obvious

4. How can the community help?

Any ideas how I can determine what the content of the message is that is causing the decode error would be helpful.

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

Hello @chavez243ca

Can you show you Nxlog config?
Can you show your configuration for your INPUT?

Next,

Does your format in NXlog match your Graylog INPUT GELF_TCP?
For example I also have multiple AD DC in my environment using Nxlog
Example perhaps this may help.

nxlog_config
define ROOT C:\Program Files (x86)\nxlog
define CERTDIR %ROOT%\cert
define LOGFILE C:\Program Files (x86)\nxlog\data\nxlog.log


Moduledir %ROOT%\modules
CacheDir %ROOT%\data
Pidfile %ROOT%\data\nxlog.pid
SpoolDir %ROOT%\data
LogFile %ROOT%\data\nxlog.log

<Extension _fileop>
    Module xm_fileop
    # Check the log file size every hour and rotate if larger than 5 MB
    <Schedule>
        Every 1 hour
        <Exec>
            if (file_exists('%LOGFILE%') and file_size('%LOGFILE%') >= 5M)
                file_cycle('%LOGFILE%', 8);
        </Exec>
    </Schedule>
    # Rotate log file every week on Sunday at midnight
    <Schedule>
        When    @weekly
        Exec    if file_exists('%LOGFILE%') file_cycle('%LOGFILE%', 8);
    </Schedule>
</Extension>

<Extension gelf>
    Module      xm_gelf
 </Extension>

<Input  DC-101>
    Module      im_msvistalog    
</Input>

<Output out>
    Module      om_ssl 
    Host        graylog.domain.com
    Port        51412
    OutputType  GELF_TCP 
    CertFile    %CERTDIR%/graylog3-certificate.pem
    CertKeyFile %CERTDIR%/graylog3-key.pem
    CAFile      %CERTDIR%/cert3.pem
    KeyPass     secret 
    AllowUntrusted  true   
    Exec $Hostname = hostname_fqdn();
    Exec $FullMessage = $raw_event;
    #Exec        to_syslog_snare();
</Output>

<Route >
    Path        DC-101 => out
</Route>

INPUT configuration

image

So somewhere you sending logs that are not formatted correctly in the type of input your useing. That would be the first place I look. Next would be nxlog , see what else it is grabbing from Windows and sending to Graylog ( Application logs, Database logs, etc…)
The reason I ask this is because I see this in your logs.

om.fasterxml.jackson.core.JsonParseException: Unexpected character

Also If I can add, do you have any extractors or pipelines on that input?

Configs:

win_tcp

define ROOT C:\Program Files (x86)\nxlog

Moduledir %ROOT%\modules
CacheDir %ROOT%\data
Pidfile %ROOT%\data\nxlog.pid
SpoolDir %ROOT%\data
LogFile %ROOT%\data\nxlog.log

<Extension _syslog>
    Module      xm_gelf
</Extension>

<Extension json>
	Module	xm_json
</Extension>

<Extension dhcp_csv_parser>
    Module      xm_csv
    Fields      ID, Date, Time, Description, IPAddress, Hostname, MACAddress, \
                UserName, TransactionID, QResult, ProbationTime, CorrelationID, \
                DHCID, VendorClassHex, VendorClassASCII, UserClassHex, \
                UserClassASCII, RelayAgentInformation, DnsRegError
</Extension>

<Extension dhcpv6_csv_parser>
    Module      xm_csv
    Fields      ID, Date, Time, Description, IPv6Address, Hostname, ErrorCode, \
                DuidLength, DuidBytesHex, UserName, Dhcid, SubnetPrefix
</Extension>

<Input inWindowsAudit>
    Module      im_msvistalog
    ReadFromLast	True
    Query <QueryList> \
	<Query Id="0"> \
	<Select Path="Application">*</Select> \
	<Select Path="Microsoft-Windows-Sysmon/Operational">*</Select> \
	<Select Path="Security">*</Select> \
	<Select Path="System">*[System[(level='4')]]</Select> \
	<Suppress Path="Application">*[System[(EventID=258)]]</Suppress> \
	<Select Path="Microsoft-Windows-PowerShell/Operational"> \
                    *[System[(Level=0 or Level=1 or Level=2 or Level=3 or Level=4) \
                             and ((EventID &gt;= 4104 and EventID &lt;= 4106))]] \
        </Select> \
	</Query> \
	</QueryList>

</Input>

<Input dhcp_server_audit>
    Module  im_file
    file "c:\\dhcp\DhcpSrvLog-???.log"
    <Exec>
        # Only process lines that begin with an event ID
        if $raw_event =~ /^\d+,/
        {
                dhcp_csv_parser->parse_csv();
                $QResult = integer($QResult);
                if $QResult == 0 $QMessage = "NoQuarantine";
                else if $QResult == 1 $QMessage = "Quarantine";
                else if $QResult == 2 $QMessage = "Drop Packet";
                else if $QResult == 3 $QMessage = "Probation";
                else if $QResult == 6 $QMessage = "No Quarantine Information";
            $EventTime = strptime($Date + ' ' + $Time, '%m/%d/%y %H:%M:%S');
            $ID = integer($ID);
            # DHCP Event IDs
            if $ID == 0 $Message = "The log was started.";
            else if $ID == 1 $Message = "The log was stopped.";
            else if $ID == 2
                $Message = "The log was temporarily paused due to low disk space.";
            else if $ID == 10 $Message = "A new IP address was leased to a client.";
            else if $ID == 11 $Message = "A lease was renewed by a client.";
            else if $ID == 12 $Message = "A lease was released by a client.";
            else if $ID == 13
                $Message = "An IP address was found to be in use on the network.";
            else if $ID == 14
                $Message = "A lease request could not be satisfied because the " +
                           "scope's address pool was exhausted.";
            else if $ID == 15 $Message = "A lease was denied.";
            else if $ID == 16 $Message = "A lease was deleted.";
            else if $ID == 17
                $Message = "A lease was expired and DNS records for an expired " +
                           "leases have not been deleted.";
            else if $ID == 18
                $Message = "A lease was expired and DNS records were deleted.";
            else if $ID == 20
                $Message = "A BOOTP address was leased to a client.";
            else if $ID == 21
                $Message = "A dynamic BOOTP address was leased to a client.";
            else if $ID == 22
                $Message = "A BOOTP request could not be satisfied because the " +
                           "scope's address pool for BOOTP was exhausted.";
            else if $ID == 23
                $Message = "A BOOTP IP address was deleted after checking to see " +
                           "it was not in use.";
            else if $ID == 24
                $Message = "IP address cleanup operation has began.";
            else if $ID == 25
                $Message = "IP address cleanup statistics.";
            else if $ID == 30
                $Message = "DNS update request to the named DNS server.";
            else if $ID == 31 $Message = "DNS update failed.";
            else if $ID == 32 $Message = "DNS update successful.";
            else if $ID == 33
                $Message = "Packet dropped due to NAP policy.";
            else if $ID == 34
                $Message = "DNS update request failed as the DNS update request " +
                           "queue limit exceeded.";
            else if $ID == 35 $Message = "DNS update request failed.";
            else if $ID == 36
                $Message = "Packet dropped because the server is in failover " +
                           "standby role or the hash of the client ID does not " +
                           "match.";
            else if ($ID >= 50 and $ID < 1000)
                $Message = "Codes above 50 are used for Rogue Server Detection " +
                           "information.";
            # DHCPv6 Event IDs
            else if $ID == 11000 $Message = "DHCPv6 Solicit.";
            else if $ID == 11001 $Message = "DHCPv6 Advertise.";
            else if $ID == 11002 $Message = "DHCPv6 Request.";
            else if $ID == 11003 $Message = "DHCPv6 Confirm.";
            else if $ID == 11004 $Message = "DHCPv6 Renew.";
            else if $ID == 11005 $Message = "DHCPv6 Rebind.";
            else if $ID == 11006 $Message = "DHCPv6 Decline.";
            else if $ID == 11007 $Message = "DHCPv6 Release.";
            else if $ID == 11008 $Message = "DHCPv6 Information Request.";
            else if $ID == 11009 $Message = "DHCPv6 Scope Full.";
            else if $ID == 11010 $Message = "DHCPv6 Started.";
            else if $ID == 11011 $Message = "DHCPv6 Stopped.";
            else if $ID == 11012 $Message = "DHCPv6 Audit log paused.";
            else if $ID == 11013 $Message = "DHCPv6 Log File.";
            else if $ID == 11014 $Message = "DHCPv6 Bad Address.";
            else if $ID == 11015 $Message = "DHCPv6 Address is already in use.";
            else if $ID == 11016 $Message = "DHCPv6 Client deleted.";
            else if $ID == 11017 $Message = "DHCPv6 DNS record not deleted.";
            else if $ID == 11018 $Message = "DHCPv6 Expired.";
            else if $ID == 11019
                $Message = "DHCPv6 Leases Expired and Leases Deleted.";
            else if $ID == 11020 $Message = "DHCPv6 Database cleanup begin.";
            else if $ID == 11021 $Message = "DHCPv6 Database cleanup end.";
            else if $ID == 11022 $Message = "DNS IPv6 Update Request.";
            else if $ID == 11023 $Message = "DNS IPv6 Update Failed.";
            else if $ID == 11024 $Message = "DNS IPv6 Update Successful.";
            else if $ID == 11028
                $Message = "DNS IPv6 update request failed as the DNS update " +
                           "request queue limit exceeded.";
            else if $ID == 11029 $Message = "DNS IPv6 update request failed.";
            else if $ID == 11030
                $Message = "DHCPv6 stateless client records purged.";
            else if $ID == 11031
                $Message = "DHCPv6 stateless client record is purged as the " +
                           "purge interval has expired for this client record.";
            else if $ID == 11032
                $Message = "DHCPV6 Information Request from IPV6 Stateless Client.";
            else $Message = "No message specified for this Event ID.";
        }
        # Discard header lines (which do not begin with an event ID)
        else drop();
    </Exec>
</Input>


<Output outGraylogTCP>
    Module      om_tcp
    Host        x.x.x.x
    Port        12201
    OutputType	GELF
</Output>

<Output outGraylogUDP>
    Module      om_udp
    Host        x.x.x.x
    Port        12201
    OutputType	GELF
</Output>

<Route 1>
    Path        inWindowsAudit => outGraylogTCP
</Route>

<Route 2>
    Path        dhcp_server_audit => outGraylogUDP
</Route>

And to answer your final question - no extractors on that input, but I do have pipelines for the Event Logs stream.

Well you have a couple things I found that look like incorrect configurations.

I was confused, not sure if this was supposed to be syslog or gelf.

Think it should be.

<Extension gelf>
    Module      xm_gelf
</Extension>

Next are these Outputs.

First of all you have both GELF TCP/UDP going to the same GL port. You may want to use different ports.
Next, your OutputType is configured wrong. Should look like this.

OutputType  GELF_TCP 
OutputType  GELF_UDP

Here is the documentation for that.

I noticed for one of the Extensions your using Module xm_csv, instead of using GELF /UDP you may want to use RawPlainText /TCP or /UDP instead.

I must admit you have a lot going on with NXLog configuration file.

EDIT: I made a mistake above, when I stated " going to the same GL port." I meant to say same Input. If your going to use Gelf_tcp and Gelf_udp then you should have two different inputs like GELF TCP && GELF UDP.

EDIT2: I’ve been going over your configurations and noticed something else that doesn’t make sense , perhaps you can enlighten me.

In you input you have admin as a tls_key_file BUT you don have the tls_cert_file configured?

Those two normally have the full path where the certificates are on your Graylog server, then you need to enable tls_enable = false to tls_enable = true. I’m kind of stumped how this even worked. Here is an example of my gelf tcp/tls input.

2 Likes

Despite the multitude of oddities you have uncovered, NXlog and the input are functioning as desired presently, outside the RST issue.

OutputType GELF is synonymous with OutputType GELF_UDP, however OutputType GELF_TCP has to be explicitly stated - that appears to be the source of the errors and RSTs. That single change has addressed the issue on this input.

Thanks for your input, a little RTFM was clearly in order.

Cheers.

2 Likes

Glad I could help, I was trying to cover everything I seen. If you could mark this as resolved that would be great or future search :smiley:.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.