Monday, November 23, 2015

(Rolling) restart of elasticsearch datanodes

elasticsearch 1.7.3





Planned restart of the data nodes should include stopping the routing of traffic to avoid unnecessary rebalancing and prolonged recovery period when node rejoins the cluster.

Example:
Stop the routing:
PUT /_cluster/settings
{
    "transient" : {
        "cluster.routing.allocation.enable" : "none"
    }
}

Should reply with:
{
  "persistent": {
  },
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "enable": "none"
        }
      }
    }
  },
  "acknowledged": true
}


Stop and do whatever you need to do, then start the node and wait for the cluster reporting the node rejoin in the logs:
[2015-11-23 01:18:32,623][INFO ][cluster.service          ] [servername] added {[servername2][2DwlAl3SAe-aijdas1336Ew][servername2][inet[/1.1.1.2:9300]],}, reason: zen-disco-receive(join from node[[servername2][2DwlAl3SAe-aijdas1336Ew][servername2][inet[/1.1.1.2:9300]]])

When to re-enable routing is a question on how busy the cluster is. Reenabling will cause additional load and stress and myself have been delaying this for some hours until a suitable occasion appears. Must stress that the documents the rejoined node has will not be visible to the cluster nor will the rejoined cluster in any way offload the rest since data it has is not present (of course).

The cluster might also depending on shard and replication setting not be redundant during the routing allocation is off, what the overall status; orange ok during transition.

To reenable the routing
PUT /_cluster/settings
{
    "transient" : {
        "cluster.routing.allocation.enable" : "all"
    }
}

Watch the status to become green and continue to next node if needed. The time for the cluster to become green, that is in this context for all shards and replication criteria to be met varies highly with the capacity of each node, the overall load and not at least the number of documents. Eg. a 3 node cluster with 200M doc's should spend somewhere around 10 minutes for recovery, not more.

GET _cluster/health
{
  "cluster_name": "clustername",
  "active_primary_shards": 56,
  "active_shards": 112,
  "number_of_data_nodes": 3,
  "number_of_in_flight_fetch": 0,
  "number_of_nodes": 5,
  "unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "timed_out": false,
  "delayed_unassigned_shards": 0,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "status": "green"

}


No comments:

VoWifi leaking IMSI

This is mostly a copy of the working group two blog I worked for when the research was done into the fields of imsi leakage when using voice...