Issue with send data to graylog

Problem description:
Hello. A few days ago I upgraded Graylog 3.3.15 to 4.2.7 and ElasticSearch cluster for this graylog 6.8.8 to 7.10.1. Everything seems perfect, without any error messages, search bar in graylog works, prometheus metrics works too.
But I closed graylog web page, and after minutes I received alert from ElasticSearch cluster, that new data (documents) missing in ES. I opened graylog web page again and everything saw ok, because I saw data in search dashboard. After close graylog web page I received same alert from ES.
I investigated it and I found that graylog didn’t send new documents to ES when graylog page is closed. When I opened graylog search via web page data was append to ES cluster again.

I check following graylog metrics, if data are go in and out:

  • gl_input_throughput
  • gl_output_throughput
  • gl_journal_append_1_sec_rate
  • gl_journal_read_1_sec_rate

All data are processed:
graylog-data

However, ES metric elasticsearch_indices_docs indicate that graylog data is not sent to the ES. I need open graylog web page and after this I will see change in this metrics.

Graylog configuration:

    # General
    node_id_file = /usr/share/graylog/data/journal/node-id
    root_username = admin
    root_email = EMAIL
    root_timezone = Europe/Prague
    plugin_dir = /usr/share/graylog/plugins-default
    http_bind_address = 0.0.0.0:9000
    http_external_uri = https://URL/
    http_enable_cors = true
    enabled_tls_protocols = TLSv1.1,TLSv1.2,TLSv1.3

    # Output & Input
    output_batch_size = 200
    output_flush_interval = 1
    output_fault_count_threshold = 6
    output_fault_penalty_seconds = 10
    processbuffer_processors = 6
    outputbuffer_processors = 6
    processor_wait_strategy = blocking
    ring_size = 65536
    inputbuffer_ring_size = 65536
    inputbuffer_processors = 2
    inputbuffer_wait_strategy = blocking
    message_journal_enabled = true
      # Do not change `message_journal_dir` location
    message_journal_dir = /usr/share/graylog/data/journal
    outputbuffer_processor_keep_alive_time = 5000
    outputbuffer_processor_threads_core_pool_size = 5
    outputbuffer_processor_threads_max_pool_size = 30
    message_journal_max_age = 12h
      # size is 75% of persistent volume (journal-graylog)
    message_journal_max_size = 15gb
    message_journal_flush_age = 1m
    message_journal_flush_interval = 100000

    # MongoDB
    mongodb_max_connections = 1000
    mongodb_threads_allowed_to_block_multiplier = 5

    #ElasticSearch
    rotation_strategy = count
    elasticsearch_max_docs_per_index = 10000000
    elasticsearch_shards = 12
    elasticsearch_index_optimization_jobs = 40
    elasticsearch_connect_timeout = 10s
    elasticsearch_socket_timeout = 60s
    elasticsearch_max_total_connections = 100
    elasticsearch_max_total_connections_per_route = 10
    allow_leading_wildcard_searches = true
    allow_highlighting = false
    elasticsearch_version = 7
    elasticsearch_mute_deprecation_warnings = true

    # Email transport
    transport_email_enabled = true
    transport_email_hostname = aspmx.l.google.com
    transport_email_port = 25
    transport_email_use_auth = false
    transport_email_use_tls = true
    transport_email_use_ssl = false
    transport_email_auth_username =
    transport_email_auth_password =
    transport_email_subject_prefix = [graylog]
    transport_email_from_email = EMAIL
    content_packs_dir = /usr/share/graylog/data/contentpacks
    content_packs_auto_load = grok-patterns.json

    # Prometheus
    prometheus_exporter_enabled = true
    prometheus_exporter_bind_address = 0.0.0.0:9833

    # Others
    proxied_requests_thread_pool_size = 32

Additional information:

  • Graylog version: 4.2.7-1
  • ES cluster version: 7.10.1
  • Running in kubernetes

Thank for fast response.

Hello @Tomas

Need to ask a couple questions.

Can you show this alert?

Are you using a reverse proxy? if not I would comment that line out. Since this setting is for HTTPS & I do not see certificates. If you do are they from your CA or did you create self-signed one?

Can you show the logs for this issue?

How many nodes do you have in this cluster? and what does Elasticsearch YAML configuration file look like?

I don’t see this section configure either

elasticsearch_hosts = http://node1:9200,http://node2:9200,,http://node3:9200

Do you have 1 Graylog Server, 1 MongoDb? are they on there on Node?

To be honest this can be a a lot of configuration issues.

First, I would look at all your log files (Graylog, MongoDb & elasticsearch) there has to be something in there that can give you a idea what/why this is happening.

Hello @gsmith

  • Alert from alert manager
    - alert: ElasticsearchNoNewDocuments
      expr: increase(elasticsearch_indices_docs{es_data_node="true", namespace="graylog"}[10m]) < 1
      for: 0m
      labels:
        severity: '3'
        team: infrastructure
      annotations:
        summary: 'Namespace {{ $labels.namespace }}: Elasticsearch no new documents (instance {{ $labels.name }})'
        description: 'Namespace {{ $labels.namespace }}: No new documents for 10 min! Possible data loss as new documents are not ingested'
  • We don’t uses reverse proxy for graylog, but ingress controller in kubernetes with let’s encrypt certificate. (this url is added dynamically, because we have more graylog instances)
  • I don’t know how can I show you this missing documents. I don’t have logs for this, because graylog and ES-cluster running correctly without error messages. This issue I see only via prometheus metrics, that metrics who show total documents on node are not increased normaly but only after when I open graylog web page.
  • ES-cluster setup is 3 master node and 3 data nodes. We uses ES in kubernetes and ES-cluster is managed by ECK operator. But following yaml is how we deploy ES-cluster to k8s, some parts are generated from config file.
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch
  labels:
    app: elasticserach
    group: elasticsearch
spec:
  version: 7.10.1
  secureSettings:
  - secretName: elasticsearch-snapshot
  http:
    tls:
      certificate:
        secretName: elasticsearch-es-http-certs-internal
  nodeSets:
  # Nodeset configuration for each defined item in list
  {% for nodeset in elasticsearch_cluster['nodeSets'] %}
  - name: {{ nodeset['name'] }} # Render nodeSet name
    count: {{ nodeset['replicas'] }} # Render number of replicas for this particular nodeSet
    config:
      cluster.remote.connect: false
      # Render different options based of elasticsearch node type
      {% if nodeset['node_type'] == 'master' %} 
      node.master: true
      node.data: false
      node.ingest: false
      {% endif %}
      {% if nodeset['node_type'] == 'data' %}
      node.master: false
      node.data: true
      node.ingest: true
      {% endif %}

    podTemplate:
      spec:
        initContainers:
        - name: sysctl
          securityContext:
            privileged: true
          command: ['sh', '-c', 'sysctl -w vm.max_map_count=1966080']
        - name: install-plugins
          command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch repository-gcs
        containers:
        - name: elasticsearch
          resources: {{ nodeset['k8s_resources'] }} # Render req resource requests and limits
          env:
          - name: ES_JAVA_OPTS
            value: {{ nodeset['heap_size'] |json_string }} # Render heap wich should be 50% of available memory
  • For this we use environment variable in graylog container
      containers:
        - name: graylog-server
          image: graylog/graylog:4.2.7-1
          imagePullPolicy: "IfNotPresent"
          command:
            - /entrypoint.sh
          env:
            - name: GRAYLOG_SERVER_JAVA_OPTS
              value: "-Djava.net.preferIPv4Stack=true -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow -Xms{{ graylog_memory_requests|default(1024)|json }}m -Xmx{{ graylog_memory_requests|default(1024)|json }}m"
            - name: GRAYLOG_PASSWORD_SECRET
              valueFrom:
                secretKeyRef:
                  name: graylog-secrets
                  key: admin-password
            - name: GRAYLOG_ROOT_PASSWORD_SHA2
              valueFrom:
                secretKeyRef:
                  name: graylog-secrets
                  key: admin-password-sha2
            - name: ELASTICSEARCH_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: elasticsearch-es-elastic-user
                  key: elastic
            - name: GRAYLOG_ELASTICSEARCH_HOSTS
              value: https://elastic:$(ELASTICSEARCH_PASSWORD)@elasticsearch-es-http:9200

elasticsearch-es-http:9200 is service in k8s, and this sevices are greated by ECK-operator.

kubectl get service
NAME                                       TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                         AGE
elastic-webhook-server                     ClusterIP      x.x.x.x        <none>          443/TCP                         49d
elasticsearch-es-data                      ClusterIP      None           <none>          9200/TCP                        15d
elasticsearch-es-http                      ClusterIP      x.x.x.x        <none>          9200/TCP                        15d
elasticsearch-es-internal-http             ClusterIP      x.x.x.x        <none>          9200/TCP                        15d
elasticsearch-es-master                    ClusterIP      None           <none>          9200/TCP                        15d
elasticsearch-es-transport                 ClusterIP      None           <none>          9300/TCP                        15d
elasticsearch-metrics                      ClusterIP      x.x.x.x        <none>          9114/TCP                        49d

In company we have 3 graylog instances. Every have same config, but only one was upgraded to newest version. Graylog instances with old version (3.3.10) are without this issue. And setup is 3 mongo servers (version 4.2.1) in setup primary, secondary, arbiter, 3 graylog servers (graylog HA cluster) and ES cluster with 3 data nodes and 3 master nodes.

Hello,

Thanks for the add info.

My apologies. I was referring to the service logs on the system, I really don’t have any idea what this alert manager is.

example:

root# docker logs -f "graylog_container_id'

I understand now you just have one Elasticsearch node configured.

name: GRAYLOG_ELASTICSEARCH_HOSTS
              value: https://elastic:$(ELASTICSEARCH_PASSWORD)@elasticsearch-es-http:9200

What I’m confuse about is that you showed your Graylog configuration file above but your using ENV variables for elasticsearch. What I was referring to is this below.

Docker

 - GRAYLOG_ELASTICSEARCH_HOSTS = http://node1:9200,http://node2:9200,http://node3:9200

For a better understand I was referring to this section.

Graylog_Elasticsearch_Hosts

As you know there have been some major changes from 3.3 to 4.2
You can find out more here.

As for you statement.

I haven’t seen that before so I’m assuming this could be a configuration issue. Have you investigated the permission of the user logged into Graylog Web UI.
It may have something to do with this configuration from your Graylog container, but I’m not 100% sure.

The reason I stated because you have different environment variables then I have.

 graylog:
    image: graylog/graylog:4.2-jre11
    network_mode: bridge
    dns:
      - 192.168.2.15
      - 192.168.2.16
   # journal and config directories in local NFS share for persistence
    volumes:       
       - graylog_bin:/usr/share/graylog/bin
       - graylog_data:/usr/share/graylog/data/config   
 
    environment:
      # Container time Zone
      - TZ=America/Chicago
      # CHANGE ME (must be at least 16 characters)!
      - GRAYLOG_PASSWORD_SECRET=pJod1TRZAckHmqZuyb2YWIjWgMtnwZf6Q79HW2nonDhN
      # Password: admin
      - GRAYLOG_ROOT_PASSWORD_SHA2=ef92b778bafe771ec066599118813d4473e94f
      - GRAYLOG_HTTP_BIND_ADDRESS=0.0.0.0:9000
      - GRAYLOG_HTTP_EXTERNAL_URI=http://192.168.1.28:9000/
      - GRAYLOG_ROOT_TIMEZONE=America/Chicago
      - GRAYLOG_ELASTICSEARCH_HOSTS = http://node1:9200,http://node2:9200,http://node3:9200
      - GRAYLOG_ROOT_EMAIL=greg.smith@enseva.com
      - GRAYLOG_HTTP_PUBLISH_URI=http://192.168.1.28:9000/
      - GRAYLOG_TRANSPORT_EMAIL_PROTOCOL=smtp
      - GRAYLOG_HTTP_ENABLE_CORS=true
      - GRAYLOG_TRANSPORT_EMAIL_WEB_INTERFACE_URL=http://192.168.1.28:9000/
      - GRAYLOG_TRANSPORT_EMAIL_HOSTNAME=192.168.1.28
      - GRAYLOG_TRANSPORT_EMAIL_ENABLED=true
      - GRAYLOG_TRANSPORT_EMAIL_PORT=25
      - GRAYLOG_TRANSPORT_EMAIL_USE_AUTH=false
      - GRAYLOG_TRANSPORT_EMAIL_USE_TLS=false
      - GRAYLOG_TRANSPORT_EMAIL_USE_SSL=false
      - GRAYLOG_TRANSPORT_FROM_EMAIL=root@localhost
      - GRAYLOG_TRANSPORT_SUBJECT_PREFIX=[graylog]
      - GRAYLOG_REPORT_DISABLE_SANDBOX=true

Since your using Ingress controller and you have three Graylog nodes I would assume you set all to master nodes or do you have one master and two that are not?

Example from the server.conf file it states…

If you are running more than one instances of Graylog server you have to select one of these
instances as master. The master will perform some periodical tasks that non-masters won’t perform.

is_master = true

So I assume the other two Graylog node are like this? Just a thought.

Node-02 is_master = false
Node-03 is_master = false

And again my apologies, your environment is unfamiliar to me. This is the first time I heard or seen Graylog logs not being indexed after you close your Web UI (AKA Logoff). So I assume its a permission issue or some type of configuration issue. I’m unable to get a clear picture where this issue could be.

If I understand you statement correct, you didn’t upgrade all your Graylog nodes in the cluster? or did I read that incorrect?

We found what is wrong and why ES metric about document count on node was not updated. More information about it here: Tune for indexing speed | Elasticsearch Guide [8.1] | Elastic
Can you close this topic.
Thx for help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.