Stability problem


(Can) #1

My cluster (3XES and 3XGL(with Mongo)) has crashed in weekend while the workload was only 1 filebeat reporting few logs from a single machine.
Restarting whole cluster did not solve the “Elasticsearch cluster is red. Shards: 26 active, 0 initializing, 0 relocating, 2 unassigned” trouble.
Web interfaces are working but extremely slow I often get timeouts. And we are in same network my conection to other machine in same network is flawless.
This status makes me think to go to production level where several hundered machines logs and support team follows reports and dashboards.
Any suggestions?


(Jochen) #2

Anything in the logs of your Graylog and Elasticsearch nodes? Is there enough free disk space on all machines?
:arrow_right: http://docs.graylog.org/en/2.3/pages/configuration/file_location.html


(Can) #3

Since I just deleted all active indices nothing in both but i will do a restart and send logs


(Jochen) #4

And before you’ve deleted everything? It’s obviously more interesting to check the logs from when the errors occurred…


(Can) #5

Sure I do checklogs
here
2017-08-02T17:29:50.932+02:00 INFO [GracefulShutdown] Graceful shutdown initiated.
2017-08-02T17:29:50.932+02:00 INFO [GracefulShutdown] Node status: [Halting [LB:DEAD]]. Waiting <3sec> for possible load balancers to recognize state change.
2017-08-02T17:29:54.934+02:00 INFO [InputSetupService] Attempting to close input <org.graylog.plugins.beats.BeatsInput.596df5c31a39ed672cacc515> [Beats].
2017-08-02T17:29:54.937+02:00 INFO [GelfTcpTransport] Channel disconnected!
2017-08-02T17:29:54.943+02:00 INFO [InputSetupService] Input <org.graylog.plugins.beats.BeatsInput.596df5c31a39ed672cacc515> closed. Took [8ms]
2017-08-02T17:29:54.947+02:00 INFO [Buffers] Waiting until all buffers are empty.
2017-08-02T17:29:54.949+02:00 INFO [Buffers] All buffers are empty. Continuing.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.periodical.AlertScannerThread].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.periodical.AlertScannerThread] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.periodical.BatchedElasticSearchOutputFlushThread].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.periodical.BatchedElasticSearchOutputFlushThread] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.periodical.ClusterHealthCheckThread].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.periodical.ClusterHealthCheckThread] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.periodical.IndexerClusterCheckerThread].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.periodical.IndexerClusterCheckerThread] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.periodical.IndexRetentionThread].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.periodical.IndexRetentionThread] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.periodical.IndexRotationThread].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.periodical.IndexRotationThread] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.periodical.VersionCheckThread].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.periodical.VersionCheckThread] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.periodical.ThrottleStateUpdaterThread].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.periodical.ThrottleStateUpdaterThread] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.events.ClusterEventPeriodical].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.events.ClusterEventPeriodical] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.events.ClusterEventCleanupPeriodical].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.events.ClusterEventCleanupPeriodical] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog2.periodical.IndexRangesCleanupPeriodical].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog2.periodical.IndexRangesCleanupPeriodical] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog.plugins.usagestatistics.UsageStatsNodePeriodical].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog.plugins.usagestatistics.UsageStatsNodePeriodical] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog.plugins.usagestatistics.UsageStatsClusterPeriodical].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog.plugins.usagestatistics.UsageStatsClusterPeriodical] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutting down periodical [org.graylog.plugins.collector.periodical.PurgeExpiredCollectorsThread].
2017-08-02T17:29:54.950+02:00 INFO [PeriodicalsService] Shutdown of periodical [org.graylog.plugins.collector.periodical.PurgeExpiredCollectorsThread] complete, took <0ms>.
2017-08-02T17:29:54.950+02:00 INFO [GracefulShutdown] Goodbye.
2017-08-02T17:29:54.951+02:00 INFO [JerseyService] Shutting down HTTP listener at http://GL1_IP:9000/api/
2017-08-02T17:29:54.951+02:00 INFO [JournalReader] Stopping.
2017-08-02T17:29:54.951+02:00 INFO [node] [GL_1] stopping …
2017-08-02T17:29:54.955+02:00 INFO [LogManager] Shutting down.
2017-08-02T17:29:54.964+02:00 INFO [OutputSetupService] Stopping output org.graylog2.outputs.GelfOutput
2017-08-02T17:29:54.999+02:00 INFO [LogManager] Shutdown complete.
2017-08-02T17:29:55.004+02:00 INFO [node] [GL_1] stopped
2017-08-02T17:29:55.005+02:00 INFO [node] [GL_1] closing …
2017-08-02T17:29:55.020+02:00 INFO [NetworkListener] Stopped listener bound to [GL1_IP:9000]
2017-08-02T17:29:55.022+02:00 INFO [node] [GL_1] closed
2017-08-02T17:29:55.022+02:00 INFO [ServiceManagerListener] Services are now stopped.
2017-08-02T17:30:07.226+02:00 INFO [CmdLineTool] Loaded plugin: ExeCommandAlarmCallBack 1.0.0 [ir.elenoon.ExeCommandAlarmCallBackPlugin]
2017-08-02T17:30:07.228+02:00 INFO [CmdLineTool] Loaded plugin: Elastic Beats Input 2.2.3 [org.graylog.plugins.beats.BeatsInputPlugin]
2017-08-02T17:30:07.229+02:00 INFO [CmdLineTool] Loaded plugin: Collector 2.2.3 [org.graylog.plugins.collector.CollectorPlugin]
2017-08-02T17:30:07.230+02:00 INFO [CmdLineTool] Loaded plugin: Enterprise Integration Plugin 2.2.3 [org.graylog.plugins.enterprise_integration.EnterpriseIntegrationPlugin]
2017-08-02T17:30:07.231+02:00 INFO [CmdLineTool] Loaded plugin: MapWidgetPlugin 2.2.3 [org.graylog.plugins.map.MapWidgetPlugin]
2017-08-02T17:30:07.239+02:00 INFO [CmdLineTool] Loaded plugin: Pipeline Processor Plugin 2.2.3 [org.graylog.plugins.pipelineprocessor.ProcessorPlugin]
2017-08-02T17:30:07.240+02:00 INFO [CmdLineTool] Loaded plugin: Anonymous Usage Statistics 2.2.3 [org.graylog.plugins.usagestatistics.UsageStatsPlugin]
2017-08-02T17:30:07.241+02:00 INFO [CmdLineTool] Loaded plugin: HipChat Alarmcallback Plugin 1.3.0-SNAPSHOT [org.graylog2.alarmcallbacks.hipchat.HipChatAlarmCallback]
2017-08-02T17:30:07.514+02:00 INFO [CmdLineTool] Running with JVM arguments: -Xms1g -Xmx1g -XX:NewRatio=1 -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow -Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml -Djava.library.path=/usr/share/graylog-server/lib/sigar -Dgraylog2.installation_source=rpm
2017-08-02T17:30:07.787+02:00 INFO [Version] HV000001: Hibernate Validator null
2017-08-02T17:30:12.364+02:00 INFO [InputBufferImpl] Message journal is enabled.
2017-08-02T17:30:12.396+02:00 INFO [NodeId] Node ID: 858e6fe1-c997-4f6a-b3c7-c8e507f705b4
2017-08-02T17:30:12.725+02:00 INFO [LogManager] Loading logs.
2017-08-02T17:30:12.904+02:00 INFO [LogManager] Logs loading complete.
2017-08-02T17:30:12.906+02:00 INFO [KafkaJournal] Initialized Kafka based journal at /var/lib/graylog-server/journal
2017-08-02T17:30:12.939+02:00 INFO [InputBufferImpl] Initialized InputBufferImpl with ring size <65536> and wait strategy , running 2 parallel message handlers.
2017-08-02T17:30:13.023+02:00 INFO [cluster] Cluster created with settings {hosts=[GL1_IP:27017, GL2_IP:27017, GL3_IP:27017], mode=MULTIPLE, requiredClusterType=UNKNOWN, serverSelectionTimeout=‘30000 ms’, maxWaitQueueSize=5000}
2017-08-02T17:30:13.024+02:00 INFO [cluster] Adding discovered server GL1_IP:27017 to client view of cluster
2017-08-02T17:30:13.126+02:00 INFO [cluster] Adding discovered server GL2_IP:27017 to client view of cluster
2017-08-02T17:30:13.131+02:00 INFO [cluster] Adding discovered server GL3_IP:27017 to client view of cluster
2017-08-02T17:30:13.241+02:00 INFO [cluster] No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=UNKNOWN, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=GL1_IP:27017, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=GL3_IP:27017, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=GL2_IP:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
2017-08-02T17:30:13.459+02:00 INFO [connection] Opened connection [connectionId{localValue:1, serverValue:17851}] to GL1_IP:27017
2017-08-02T17:30:13.495+02:00 INFO [cluster] Monitor thread successfully connected to server with description ServerDescription{address=GL1_IP:27017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 6]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=2897343, setName=‘graylog’, canonicalAddress=GL1_IP:27017, hosts=[GL3_IP:27017, GL1_IP:27017, GL2_IP:27017], passives=[], arbiters=[], primary=‘GL1_IP:27017’, tagSet=TagSet{[]}, electionId=7fffffff000000000000000c, setVersion=1, lastWriteDate=Wed Aug 02 17:30:03 CEST 2017, lastUpdateTimeNanos=1151821913279768}
2017-08-02T17:30:13.498+02:00 INFO [connection] Opened connection [connectionId{localValue:2, serverValue:4278}] to GL2_IP:27017
2017-08-02T17:30:13.503+02:00 INFO [connection] Opened connection [connectionId{localValue:3, serverValue:4276}] to GL3_IP:27017
2017-08-02T17:30:13.508+02:00 INFO [cluster] Monitor thread successfully connected to server with description ServerDescription{address=GL3_IP:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 6]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=4644639, setName=‘graylog’, canonicalAddress=GL3_IP:27017, hosts=[GL3_IP:27017, GL1_IP:27017, GL2_IP:27017], passives=[], arbiters=[], primary=‘GL1_IP:27017’, tagSet=TagSet{[]}, electionId=null, setVersion=1, lastWriteDate=Wed Aug 02 17:30:03 CEST 2017, lastUpdateTimeNanos=1151821947421947}
2017-08-02T17:30:13.512+02:00 INFO [cluster] Discovered cluster type of REPLICA_SET
2017-08-02T17:30:13.517+02:00 INFO [cluster] Monitor thread successfully connected to server with description ServerDescription{address=GL2_IP:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 6]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=18091958, setName=‘graylog’, canonicalAddress=GL2_IP:27017, hosts=[GL3_IP:27017, GL1_IP:27017, GL2_IP:27017], passives=[], arbiters=[], primary=‘GL1_IP:27017’, tagSet=TagSet{[]}, electionId=null, setVersion=1, lastWriteDate=Wed Aug 02 17:30:03 CEST 2017, lastUpdateTimeNanos=1151821956509539}
2017-08-02T17:30:13.516+02:00 INFO [cluster] Setting max election id to 7fffffff000000000000000c from replica set primary GL1_IP:27017
2017-08-02T17:30:13.518+02:00 INFO [cluster] Setting max set version to 1 from replica set primary GL1_IP:27017
2017-08-02T17:30:13.519+02:00 INFO [cluster] Discovered replica set primary GL1_IP:27017
2017-08-02T17:30:13.552+02:00 INFO [connection] Opened connection [connectionId{localValue:4, serverValue:17852}] to GL1_IP:27017
2017-08-02T17:30:15.166+02:00 INFO [node] [GL_1] version[2.4.4], pid[26774], build[fcbb46d/2017-01-03T11:33:16Z]
2017-08-02T17:30:15.167+02:00 INFO [node] [GL_1] initializing …
2017-08-02T17:30:15.197+02:00 INFO [plugins] [GL_1] modules [], plugins [graylog-monitor], sites []
2017-08-02T17:30:19.784+02:00 INFO [node] [GL_1] initialized
2017-08-02T17:30:20.089+02:00 INFO [ProcessBuffer] Initialized ProcessBuffer with ring size <65536> and wait strategy .
2017-08-02T17:30:22.410+02:00 INFO [RulesEngineProvider] No static rules file loaded.
2017-08-02T17:30:22.814+02:00 WARN [GeoIpResolverEngine] GeoIP database file does not exist: /etc/graylog/server/GeoLite2-City.mmdb
2017-08-02T17:30:22.823+02:00 INFO [OutputBuffer] Initialized OutputBuffer with ring size <65536> and wait strategy .
2017-08-02T17:30:22.916+02:00 WARN [GeoIpResolverEngine] GeoIP database file does not exist: /etc/graylog/server/GeoLite2-City.mmdb
2017-08-02T17:30:23.055+02:00 WARN [GeoIpResolverEngine] GeoIP database file does not exist: /etc/graylog/server/GeoLite2-City.mmdb
2017-08-02T17:30:23.172+02:00 WARN [GeoIpResolverEngine] GeoIP database file does not exist: /etc/graylog/server/GeoLite2-City.mmdb
2017-08-02T17:30:23.293+02:00 WARN [GeoIpResolverEngine] GeoIP database file does not exist: /etc/graylog/server/GeoLite2-City.mmdb
2017-08-02T17:30:24.961+02:00 INFO [ServerBootstrap] Graylog server 2.2.3+7adc951 starting up
2017-08-02T17:30:24.979+02:00 INFO [ServerBootstrap] JRE: Oracle Corporation 1.8.0_131 on Linux 3.10.0-514.26.2.el7.x86_64
2017-08-02T17:30:24.979+02:00 INFO [ServerBootstrap] Deployment: rpm
2017-08-02T17:30:24.979+02:00 INFO [ServerBootstrap] OS: CentOS Linux 7 (Core) (centos)
2017-08-02T17:30:24.980+02:00 INFO [ServerBootstrap] Arch: amd64
2017-08-02T17:30:24.996+02:00 WARN [DeadEventLoggingListener] Received unhandled event of type <org.graylog2.plugin.lifecycles.Lifecycle> from event bus <AsyncEventBus{graylog-eventbus}>
2017-08-02T17:30:25.075+02:00 INFO [PeriodicalsService] Starting 26 periodicals …
2017-08-02T17:30:25.076+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.ThroughputCalculator] periodical in [0s], polling every [1s].
2017-08-02T17:30:25.077+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.AlertScannerThread] periodical in [10s], polling every [60s].
2017-08-02T17:30:25.079+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.BatchedElasticSearchOutputFlushThread] periodical in [0s], polling every [1s].
2017-08-02T17:30:25.079+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.ClusterHealthCheckThread] periodical in [120s], polling every [20s].
2017-08-02T17:30:25.081+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.ContentPackLoaderPeriodical] periodical, running forever.
2017-08-02T17:30:25.135+02:00 INFO [node] [GL_1] starting …
2017-08-02T17:30:25.135+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.GarbageCollectionWarningThread] periodical, running forever.
2017-08-02T17:30:25.168+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexerClusterCheckerThread] periodical in [0s], polling every [30s].
2017-08-02T17:30:25.170+02:00 INFO [connection] Opened connection [connectionId{localValue:5, serverValue:17857}] to GL1_IP:27017
2017-08-02T17:30:25.174+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexRetentionThread] periodical in [0s], polling every [300s].
2017-08-02T17:30:25.175+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexRotationThread] periodical in [0s], polling every [10s].
2017-08-02T17:30:25.175+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.NodePingThread] periodical in [0s], polling every [1s].
2017-08-02T17:30:25.176+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.VersionCheckThread] periodical in [300s], polling every [1800s].
2017-08-02T17:30:25.177+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.ThrottleStateUpdaterThread] periodical in [1s], polling every [1s].
2017-08-02T17:30:25.177+02:00 INFO [Periodicals] Starting [org.graylog2.events.ClusterEventPeriodical] periodical in [0s], polling every [1s].
2017-08-02T17:30:25.184+02:00 INFO [Periodicals] Starting [org.graylog2.events.ClusterEventCleanupPeriodical] periodical in [0s], polling every [86400s].
2017-08-02T17:30:25.186+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.ClusterIdGeneratorPeriodical] periodical, running forever.
2017-08-02T17:30:25.188+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexRangesMigrationPeriodical] periodical, running forever.
2017-08-02T17:30:25.190+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexRangesCleanupPeriodical] periodical in [15s], polling every [3600s].
2017-08-02T17:30:25.191+02:00 INFO [IndexRetentionThread] Elasticsearch cluster not available, skipping index retention checks.
2017-08-02T17:30:25.210+02:00 INFO [connection] Opened connection [connectionId{localValue:6, serverValue:17858}] to GL1_IP:27017
2017-08-02T17:30:25.262+02:00 INFO [connection] Opened connection [connectionId{localValue:9, serverValue:17861}] to GL1_IP:27017
2017-08-02T17:30:25.263+02:00 INFO [connection] Opened connection [connectionId{localValue:10, serverValue:17863}] to GL1_IP:27017
2017-08-02T17:30:25.263+02:00 INFO [connection] Opened connection [connectionId{localValue:8, serverValue:17860}] to GL1_IP:27017
2017-08-02T17:30:25.266+02:00 INFO [connection] Opened connection [connectionId{localValue:7, serverValue:17859}] to GL1_IP:27017
2017-08-02T17:30:25.376+02:00 INFO [PeriodicalsService] Not starting [org.graylog2.periodical.UserPermissionMigrationPeriodical] periodical. Not configured to run on this node.
2017-08-02T17:30:25.376+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.AlarmCallbacksMigrationPeriodical] periodical, running forever.
2017-08-02T17:30:25.376+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.ConfigurationManagementPeriodical] periodical, running forever.
2017-08-02T17:30:25.428+02:00 INFO [connection] Opened connection [connectionId{localValue:11, serverValue:17862}] to GL1_IP:27017
2017-08-02T17:30:25.484+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.LdapGroupMappingMigration] periodical, running forever.
2017-08-02T17:30:25.488+02:00 INFO [Periodicals] Starting [org.graylog2.periodical.IndexFailuresPeriodical] periodical, running forever.
2017-08-02T17:30:25.533+02:00 INFO [Periodicals] Starting [org.graylog.plugins.usagestatistics.UsageStatsNodePeriodical] periodical in [300s], polling every [21600s].
2017-08-02T17:30:25.561+02:00 INFO [Periodicals] Starting [org.graylog.plugins.usagestatistics.UsageStatsClusterPeriodical] periodical in [300s], polling every [21600s].
2017-08-02T17:30:25.619+02:00 INFO [Periodicals] Starting [org.graylog.plugins.pipelineprocessor.periodical.LegacyDefaultStreamMigration] periodical, running forever.
2017-08-02T17:30:25.620+02:00 INFO [Periodicals] Starting [org.graylog.plugins.collector.periodical.PurgeExpiredCollectorsThread] periodical in [0s], polling every [3600s].
2017-08-02T17:30:25.686+02:00 INFO [LegacyDefaultStreamMigration] Legacy default stream has no connections, no migration needed.
2017-08-02T17:30:25.954+02:00 INFO [IndexerClusterCheckerThread] Indexer not fully initialized yet. Skipping periodic cluster check.
2017-08-02T17:30:26.291+02:00 INFO [V20161130141500_DefaultStreamRecalcIndexRanges] Cluster not connected yet, delaying migration until it is reachable.
2017-08-02T17:30:27.214+02:00 INFO [transport] [GL_1] publish_address {GL1_IP:9300}, bound_addresses {GL1_IP:9300}
2017-08-02T17:30:27.240+02:00 INFO [discovery] [GL_1] graylog/bET16gM3Rjes717lw_6HAA
2017-08-02T17:30:27.692+02:00 INFO [JerseyService] Enabling CORS for HTTP endpoint
2017-08-02T17:30:30.286+02:00 WARN [discovery] [GL_1] waited for 3s and no initial state was set by the discovery
2017-08-02T17:30:30.287+02:00 INFO [node] [GL_1] started
2017-08-02T17:30:30.638+02:00 INFO [service] [GL_1] detected_master {ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300}, added {{ES_3}{xQApt9oBTPW2tnhlONNPRQ}{ES_3_IP}{ES_3_IP:9300},{ES_1}{h2TnW64bTzSHc6VuVZmWDw}{ES_1_IP}{ES_1_IP:9300},{ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300},}, reason: zen-disco-receive(from master [{ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300}])
2017-08-02T17:30:30.691+02:00 WARN [IndexerSetupService] The Elasticsearch cluster state is RED which means shards are unassigned.
2017-08-02T17:30:30.691+02:00 INFO [IndexerSetupService] This usually indicates a crashed and corrupt cluster and needs to be investigated. Graylog will write into the local disk journal.
2017-08-02T17:30:30.699+02:00 INFO [IndexerSetupService] See http://docs.graylog.org/en/2.2/pages/configuration/elasticsearch.html for details.
2017-08-02T17:30:32.253+02:00 INFO [service] [GL_1] added {{GL_2}{EfMdWPL3S2CvxbSJpy8EAg}{GL2_IP}{GL2_IP:9300}{client=true, data=false, master=false},}, reason: zen-disco-receive(from master [{ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300}])
2017-08-02T17:30:33.057+02:00 INFO [service] [GL_1] added {{GL_3}{_gF43MCoQ5CE76HfIvr8ig}{GL3_IP}{GL3_IP:9300}{client=true, data=false, master=false},}, reason: zen-disco-receive(from master [{ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300}])
2017-08-02T17:30:40.194+02:00 INFO [IndexRangesCleanupPeriodical] Skipping index range cleanup because the Elasticsearch cluster is unreachable or unhealthy
2017-08-02T17:30:42.600+02:00 INFO [NetworkListener] Started listener bound to [GL1_IP:9000]
2017-08-02T17:30:42.602+02:00 INFO [HttpServer] [HttpServer] Started.
2017-08-02T17:30:42.602+02:00 INFO [JerseyService] Started REST API at http://GL1_IP:9000/api/
2017-08-02T17:30:42.603+02:00 INFO [JerseyService] Started Web Interface at http://GL1_IP:9000/
2017-08-02T17:30:42.604+02:00 INFO [ServiceManagerListener] Services are healthy
2017-08-02T17:30:42.605+02:00 INFO [InputSetupService] Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD] to Running [LB:ALIVE]
2017-08-02T17:30:42.605+02:00 INFO [ServerBootstrap] Services started, startup times in ms: {OutputSetupService [RUNNING]=42, BufferSynchronizerService [RUNNING]=171, KafkaJournal [RUNNING]=172, InputSetupService [RUNNING]=362, ConfigurationEtagService [RUNNING]=561, PeriodicalsService [RUNNING]=586, JournalReader [RUNNING]=618, StreamCacheService [RUNNING]=827, IndexerSetupService [RUNNING]=5625, JerseyService [RUNNING]=17528}
2017-08-02T17:30:42.608+02:00 INFO [ServerBootstrap] Graylog server up and running.
2017-08-02T17:30:42.626+02:00 INFO [InputStateListener] Input [Beats/596df5c31a39ed672cacc515] is now STARTING
2017-08-02T17:30:42.642+02:00 WARN [NettyTransport] receiveBufferSize (SO_RCVBUF) for input BeatsInput{title=FileBeat INPUT, type=org.graylog.plugins.beats.BeatsInput, nodeId=null} should be 1048576 but is 212992.
2017-08-02T17:30:42.646+02:00 INFO [InputStateListener] Input [Beats/596df5c31a39ed672cacc515] is now RUNNING


(Can) #6

So here is what I did

curl -XGET ES_1_IP:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

which gave me this

 mail_2     1 p UNASSIGNED CLUSTER_RECOVERED
 mail_2     0 p UNASSIGNED CLUSTER_RECOVERED

than to delete them

 curl -XDELETE 'ES_1_IP:9200/mail_2/'

Recommendation : Since you guys have already placed capabilities to create and delete existing shards you can also add these two lines as capability
So the test will continue with single system load.


(Jochen) #7

These seem to be just the logs of one or more Graylog nodes.
Please include the logs of the Elasticsearch nodes as well.

In order to make this readable, please use proper formatting and make sure to mark the log source accordingly.

Example:

```
Logs Graylog node 1
```

```
Logs Graylog node 2
```

(Can) #8

Here is the ES_1 log

[2017-08-02 17:30:11,397][INFO ][node ] [ES_1] closed
[2017-08-02 17:30:15,163][INFO ][node ] [ES_1] version[2.4.5], pid[6332], build[c849dd1/2017-04-24T16:18:17Z]
[2017-08-02 17:30:15,164][INFO ][node ] [ES_1] initializing …
[2017-08-02 17:30:16,226][INFO ][plugins ] [ES_1] modules [reindex, lang-expression, lang-groovy], plugins [kopf, hq], sites [kopf, hq]
[2017-08-02 17:30:16,392][INFO ][env ] [ES_1] using [1] data paths, mounts [[/ (rootfs)]], net usable_space [919.3gb], net total_space [961.1gb], spins? [unknown], types [rootfs]
[2017-08-02 17:30:16,396][INFO ][env ] [ES_1] heap size [989.8mb], compressed ordinary object pointers [true]
[2017-08-02 17:30:19,491][INFO ][node ] [ES_1] initialized
[2017-08-02 17:30:19,491][INFO ][node ] [ES_1] starting …
[2017-08-02 17:30:19,845][INFO ][transport ] [ES_1] publish_address {ES_1_IP:9300}, bound_addresses {ES_1_IP:9300}
[2017-08-02 17:30:19,851][INFO ][discovery ] [ES_1] graylog/h2TnW64bTzSHc6VuVZmWDw
[2017-08-02 17:30:23,162][INFO ][cluster.service ] [ES_1] detected_master {ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300}, added {{ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300},}, reason: zen-disco-receive(from master [{ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300}])
[2017-08-02 17:30:23,206][INFO ][http ] [ES_1] publish_address {ES_1_IP:9200}, bound_addresses {ES_1_IP:9200}
[2017-08-02 17:30:23,207][INFO ][node ] [ES_1] started
[2017-08-02 17:30:24,005][INFO ][cluster.service ] [ES_1] added {{ES_3}{xQApt9oBTPW2tnhlONNPRQ}{ES_3_IP}{ES_3_IP:9300},}, reason: zen-disco-receive(from master [{ES_2}{XSKdLrd1TuC8pFKL2U9
K5A}{ES_2_IP}{ES_2_IP:9300}])
[2017-08-02 17:30:30,618][INFO ][cluster.service ] [ES_1] added {{GL_1}{bET16gM3Rjes717lw_6HAA}{GL_1_IP}{GL_1_IP:9300}{client=true, data=false, master=false},}, reason: zen-disco-receive(
from master [{ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300}])
[2017-08-02 17:30:32,248][INFO ][cluster.service ] [ES_1] added {{GL_2}{EfMdWPL3S2CvxbSJpy8EAg}{GL_2_IP}{GL_2_IP:9300}{client=true, data=false, master=false},}, reason: zen-disco-receive(
from master [{ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300}])
[2017-08-02 17:30:33,055][INFO ][cluster.service ] [ES_1] added {{GL_3}{_gF43MCoQ5CE76HfIvr8ig}{GL_3_IP}{GL_3_IP:9300}{client=true, data=false, master=false},}, reason: zen-disco-receive(
from master [{ES_2}{XSKdLrd1TuC8pFKL2U9K5A}{ES_2_IP}{ES_2_IP:9300}])


(Can) #9

Sorry due to working hours I will leave now but I will keep track of it


(Jochen) #10

These are the logs from today, not when the problems occurred…


(Can) #11

So here I am again. I with another trouble I suddenly lost all my indices.
I get this error when I click system/indices

Fetching index sets list failed: No default index set configured. This is a bug!

When I go to Overview and select Indexer failures and click show failures I get

Hurray! There are not any indexer failures.

But I still can not see my indices


(Jochen) #12

Have you manually edited the MongoDB database of Graylog?

Otherwise the default index set will be automatically created when starting Graylog 2.2.0 or later for the first time.


(Can) #13

the only manual thing I did with MongoDB is to start it with --repl graylog.
I did not do anything about it. Last few days system was doing its index changes automatically.


(Jochen) #14

As requested before, please post the complete logs of your Graylog node(s).


(Can) #15

2017-08-10T07:52:39.249+02:00 ERROR [GelfTcpTransport] Exception caught
java.io.IOException: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_131]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_131]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_131]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_131]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_131]
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) ~[graylog.jar:?]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1100) ~[graylog.jar:?]
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:366) ~[graylog.jar:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:118) [graylog.jar:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [graylog.jar:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:574) [graylog.jar:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:488) [graylog.jar:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [graylog.jar:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) [graylog.jar:?]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) [graylog.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
2017-08-10T07:52:39.251+02:00 ERROR [GelfMessageJsonEncoder] JSON encoding error
java.io.IOException: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_131]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_131]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_131]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_131]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_131]
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) ~[graylog.jar:?]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1100) ~[graylog.jar:?]
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:366) ~[graylog.jar:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:118) [graylog.jar:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651) [graylog.jar:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:574) [graylog.jar:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:488) [graylog.jar:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450) [graylog.jar:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) [graylog.jar:?]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) [graylog.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]


(Can) #16

And this repeats endlessly


(Can) #17

So no solution and I restarted GL Cluster see that all settings (dasboards, streams, rules, inputs) were lost. Is there a disaster recovery for such situation?


(system) #18

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.