Elasticsearch service is running but the Cluster is Red on the Web interface


#1

Hi, I am excited to join the community and learn about Graylog in detail. Currently I have deployed a standalone Graylog2 server having all the services on the same server. I add a bunch a ASA’s to send the logs to the server and I run out of space. I added disk space to the cloud server, but the elasticsearch is cluster is still red. I tried to follow the solution provided in a similar problem but It was not helping. I found one such script online which helped a bit I think:

   curl -XPOST '192.168.10.10:9200/_cluster/reroute?pretty' -H 'Content-Type: application/json' -d'
{
"commands" : [
{
"allocate_stale_primary" : {
"index" : "graylog_0", "shard" : 0,
"node" : "graylog-node1",
"accept_data_loss" : true
}
}
]
}
'

Output displayed on the CLI:

{
 "acknowledged" : true,
  "state" : {
    "version" : 4,
    "state_uuid" : "",
    "master_node" : "",
    "blocks" : { },
    "nodes" : {
      "" : {
        "name" : "graylog-node1",
        "ephemeral_id" : "",
        "transport_address" : "192.168.10.10:9300",
        "attributes" : { }
      }
    },
    "routing_table" : {
      "indices" : {
        "graylog_0" : {
          "shards" : {
            "0" : [
              {
                "state" : "INITIALIZING",
                "primary" : true,
                "node" : "",
                "relocating_node" : null,
                "shard" : 0,
                "index" : "graylog_0",
                "recovery_source" : {
                  "type" : "EXISTING_STORE"
                },
                "allocation_id" : {
                  "id" : ""
                },
                "unassigned_info" : {
                  "reason" : "CLUSTER_RECOVERED",
                  "at" : "2017-10-31T14:16:23.605Z",
                  "delayed" : false,
                  "allocation_status" : "no_valid_shard_copy"
                }
              }
            ]
          }
        }
      }
    },
    "routing_nodes" : {
      "unassigned" : [ ],
      "nodes" : {
        "" : [
          {
            "state" : "INITIALIZING",
            "primary" : true,
            "node" : "",
            "relocating_node" : null,
            "shard" : 0,
            "index" : "graylog_0",
            "recovery_source" : {
              "type" : "EXISTING_STORE"
            },
            "allocation_id" : {
              "id" : ""
            },
            "unassigned_info" : {
              "reason" : "CLUSTER_RECOVERED",
              "at" : "2017-10-31T14:16:23.605Z",
              "delayed" : false,
              "allocation_status" : "no_valid_shard_copy"
            }
          }
        ]
      }
    }
  }
}

I only have 1 node and 1 shard, 0 replicas. Need to know how can I resolve this. Thank you, hope to resolve this soon and learn from my mistake :slight_smile:


(Matt) #2

What are the outputs of these commands?

curl -s -XGET http://192.168.10.10:9200/_cat/shards

curl -XGET 'http://192.168.10.10:9200/_cluster/health?pretty=true'

#3

curl -s -XGET http://:9200/_cat/shards ====> No output

curl -XGET ‘http://:9200/_cluster/health?pretty=true’ ====> as shown below

{
  "cluster_name" : "graylog",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 0.0
}

Thank you


#4

Sorry I had a typo error>

# curl -s -XGET http://<ip add>:9200/_cat/shards
graylog_0 0 p UNASSIGNED

(Matt) #5

I was gonna say. Weird that the health shows 1 unassigned shard but the first command returned nothing. Okay that part is ironed out. I’m no elasticsearch expert but lets check storage first. Report back your finding on this so we can make sure storage isn’t your problem. You did add storage like you said but did you expand accordingly your file system? Share the output of this command please.

df -h

#6
Filesystem      Size  Used    Avail    Use%   Mounted on
/dev/sdc        559G   11G  526G         3%   /
devtmpfs         12G     0   12G         0%   /dev
tmpfs            12G     0   12G         0%   /dev/shm
tmpfs            12G   25M   12G         1%   /run
tmpfs            12G     0   12G         0%   /sys/fs/cgroup
/dev/sda1       488M   91M  362M        21%   /boot
tmpfs           2.4G     0  2.4G         0%   /run/user/0

(Matt) #7

All looks good there. Have you made any progress on assigning the shard?


#8

I tried assigning but it did not do anything…also can you tell me how will we do that…I tried to assign/allocate it using the command/script used in the start.

I am still getting this error on the web interface.

Error Message:

Unable to perform search query. {"error":{"root_cause":[],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[]},"status":503}
Details:
{"error":{"root_cause":[],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[]},"status":503}
Search status code:
500
Search response:
cannot GET http://<ip add>:12900/search/universal/relative?query=gl2_source_input%3A3d25e&range=0&limit=150&sort=timestamp%3Adesc (500)

(Matt) #9

You could try something like this. However this is more a question for the elasticsearch crowd. I would recommend hitting up their community for more comprehensive help on shard reallocation. Also. Another option. Remove the index if losing the data is an option and start a new one. I know not always an option if data is critical. But when I have ran into an issue like this and could stand to lose the data I just purged the current index and started a new one.

curl -XPOST -d '{ "commands" : [ {
  "allocate" : {
       "index" : ".marvel-2014.05.21", 
       "shard" : 0, 
       "node" : "SOME_NODE_HERE",
       "allow_primary":true 
     } 
  } ] }' http://localhost:9200/_cluster/reroute?pretty

#10

I get the following when implemented the above command

{
  "error" : {
    "root_cause" : [
      {
        "type" : "unknown_named_object_exception",
        "reason" : "Unknown AllocationCommand [allocate]",
        "line" : 2,
        "col" : 16
      }
    ],
    "type" : "parsing_exception",
    "reason" : "[cluster_reroute] failed to parse field [commands]",
    "line" : 2,
    "col" : 16,
    "caused_by" : {
      "type" : "unknown_named_object_exception",
      "reason" : "Unknown AllocationCommand [allocate]",
      "line" : 2,
      "col" : 16
    }
  },
  "status" : 400
}

(Matt) #11

Did you replace the “objects” as they are in your implementation? correct index name, shard, node etc?


#12

Yes…I found the appropriate indexes using the command:

curl '192.168.10.10:9200/_cat/indices?v

and then used the below commands.

curl -XPOST 'http://$ip address$:9200/_cluster/reroute?pretty' -d '{ "commands" : [ {
  "allocate" : {
       "index" : ".graylog_0", 
       "shard" : 0, 
       "node" : "graylog-node1",
       "allow_primary":true 
     } 
  } ] }' 

I dont understand why it is not assigning the shard back, graylog is working, messages are coming in but elasticsearch is not indexing them :frowning:


(Jochen) #13

Are you 100% sure that “.graylog_0” (with the leading dot) is the correct index name? That looks strange to me.

Also, which version of Elasticsearch are you using? The HTTP API changed quite a bit between ES 2.x and 5.x.

And last but not least, if you don’t care for the data in the “broken” index, you could simply delete it to speed up things.


#14

you could try this:

curl -XPOST http://node_address:9200/_cluster/reroute?retry_failed

If you ran out of disk space and then added; this should try to retry assiging the shard. See https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html. Works for ES 5.


#15

@ jochen

Thank you for your response. I am using Elasticsearch 5.x

I also tried implementing it without the leasing dot. but it did not make a difference.It shows the same error indicating:

curl -XPOST ‘http://ip-add:9200/_cluster/reroute?pretty’ -d '{ “commands” : [ {

“allocate” : {
“index” : “graylog_0”,
“shard” : 0,
“node” : “graylog-node1”,
“allow_primary”:true
}
} ] }’
{
“error” : {
“root_cause” : [
{
“type” : “unknown_named_object_exception”,
“reason” : “Unknown AllocationCommand [allocate]”,
“line” : 2,
“col” : 16
}
],
“type” : “parsing_exception”,
“reason” : “[cluster_reroute] failed to parse field [commands]”,
“line” : 2,
“col” : 16,
“caused_by” : {
“type” : “unknown_named_object_exception”,
“reason” : “Unknown AllocationCommand [allocate]”,
“line” : 2,
“col” : 16
}
},
“status” : 400

That is the last resort (delete) but I want to have that data. Should I add an elasticsearch node to this single server so that it is independent and has no issues. There is not much information available to configure all the settings properly, so that we dont have any issues. It is more of that you have an error and you come across the problem. In case you have any resource to read and understand these things better please can you share.

Thanks!


#16

@ jtkarvo

the command you gave did work, as it reallocated the a new shard. But the cluster is still red and displays 1 unassigned shard.

curl -XPOST http://ip-add:9200/_cluster/reroute?retry_failed
{“acknowledged”:true,“state”:{“version”:5,“state_uuid”:“SdTHQvuDI0kgg”,“master_node”:“J7IW3TdtsEew”,“blocks”:{},“nodes”:{“J7InquWdtsEew”:{“name”:“graylog-node1”,“ephemeral_id”:“wc4LJZXCjq0w”,“transport_address”:“10.145.102.14:9300”,“attributes”:{}}},“routing_table”:{“indices”:{“my_temp_index”:{“shards”:{“0”:[{“state”:“STARTED”,“primary”:true,“node”:“J7In3TdtsEew”,“relocating_node”:null,“shard”:0,“index”:“my_temp_index”,“allocation_id”:{“id”:“Qa-ryfcMy-F2a9g”}}]}},“graylog_0”:{“shards”:{“0”:[{“state”:“UNASSIGNED”,“primary”:true,“node”:null,“relocating_node”:null,“shard”:0,“index”:“graylog_0”,“recovery_source”:{“type”:“EXISTING_STORE”},“unassigned_info”:{“reason”:“CLUSTER_RECOVERED”,“at”:“2017-11-02T14:32:28.921Z”,“delayed”:false,“allocation_status”:“no_valid_shard_copy”}}]}}}},“routing_nodes”:{“unassigned”:[{“state”:“UNASSIGNED”,“primary”:true,“node”:null,“relocating_node”:null,“shard”:0,“index”:“graylog_0”,“recovery_source”:{“type”:“EXISTING_STORE”},“unassigned_info”:{“reason”:“CLUSTER_RECOVERED”,“at”:“2017-11-02T14:32:28.921Z”,“delayed”:false,“allocation_status”:“no_valid_shard_copy”}}],“nodes”:{“J7InquWfQMEew”:[{“state”:“STARTED”,“primary”:true,“node”:“J7InquWfTdtsEew”,“relocating_node”:null,“shard”:0,“index”:“my_temp_index”,“allocation_id”:{“id”:“Qa-ryfcFTkg”}}]}}}}

There are two new errors now indicating Journal utilization is too high & uncommitted messages deleted from the journal.

Should I create a new node for elasticsearch and connect i to graylog…I think 1 standalone server is not sufficient for the the amount of logs I am receiving now.


#17

Difficult to tell. With a cluster, you can get redundancy, so that is a bonus. If you want that, I would set up 3 servers, all master nodes; and set the minimum master nodes to 2.

But you might be able to ingest your stuff fine with a single server, it could be about lack of RAM or too slow disks or something else. Can’t tell without more information.

What you need to do is to start observing the Elasticsearch to see what resources are lacking. And set the Graylog up first with an input that receives only a small amount of messages/second that you know it works properly. Then start slowly adding log volume to the system, and keep looking on how Elasticsearch works.


#18

Btw: ES shows the reason for the unassigned shard with the command

curl -XGET http://my_es_ip:9200/_cluster/allocation/explain?pretty

(if that does not work, leave the ?pretty out of it)


#19

True. I actually did that, had about 5 devices sending in logs was monitoring everything was fine. then added a couple of ASA’s to send logs and then it did not work even for a 2 days. It was too late before I could take any action. I have provided sufficient RAM and hard disks to the server, about a 1 TB of space and 24 GB Ram. I will have to look at all the components again and the configure each of them properly so that I dont face issues again. Apart from the graylog documentation, is there any other documentation/websites to read and understand these services and requirements. In case if there is any please share. Thank you for your response jtkarvo :smile:


#20

the command was helpful below is the output.

curl -XGET http://ip-add:9200/_cluster/allocation/explain?pretty
{
“index” : “graylog_0”,
“shard” : 0,
“primary” : true,
“current_state” : “unassigned”,
“unassigned_info” : {
“reason” : “CLUSTER_RECOVERED”,
“at” : “2017-11-02T14:32:28.921Z”,
“last_allocation_status” : “no_valid_shard_copy”
},
“can_allocate” : “no_valid_shard_copy”,
“allocate_explanation” : “cannot allocate because all found copies of the shard are either stale or corrupt”,
“node_allocation_decisions” : [
{
“node_id” : “J7InquWfQdtsEew”,
“node_name” : “graylog-node1”,
“transport_address” : “ip-add:9300”,
“node_decision” : “no”,
“store” : {
“in_sync” : true,
“allocation_id” : “z7Fu8FDgTc-TCFRig”,
“store_exception” : {
“type” : “corrupt_index_exception”,
“reason” : “Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MMapIndexInput(path=”/home/elasticsearch/data/nodes/0/indices/OFWa3HkuTq–E5pWM0A/0/index/segments_59")))",
“caused_by” : {
“type” : “e_o_f_exception”,
“reason” : “read past EOF: MMapIndexInput(path=”/home/elasticsearch/data/nodes/0/indices/OFWa3HkuTq–E5pZM0A/0/index/_7ana.si")"
}
}
}
}
]
}

Is there anything we can do in this case where it is corrupt ?

Thank you jtkarvo