SolrCloud: how to add a new collection to a running cluster

It's easy to find examples about how to start a SolrCloud cluster with a default collection, but it's not as easy to find examples that tell you how to add a new collection to an already running cluster. Here I'm going to describe the main steps.

Load the collection configuration in ZooKeeper

The collection configuration is nothing else than the "conf" directory of a core when Solr is in a "non Cloud" setup: when Solr is in a Cloud setup the files contained in the "conf" directory cannot be stored on a single node, but they must be loaded in ZooKeeper to be distributed to all cluster nodes.

To accomplish this task there's the zkcli.sh script contained in the Solr distribution in the example/scripts/cloud-scripts directory: it's a simplified version of the ZooKeeper script having the same name.

./zkcli.sh --zkhost localhost:2181 --cmd upconfig --confdir /node1/solr/examplecoll/conf --confname examplecollcfg

Here /node1/solr/examplecoll/conf is the directory with all the core configuration files (for a "non Cloud" Solr), while "examplecollcfg" is the name that the configuration will have in ZooKeeper. At the end of the copy in ZooKeeper you'll find a sort of directory (it's actually a path) named "examplecollcfg" containing the files that have been copied from /node1/solr/examplecoll/conf.
Of course you've to change the address at which ZooKeeper is listening. In my setup this is localhost:2181.

Bind the collection to its configuration

The second step is to bind the (to be created) collection name to its configuration. This command must be executed before creating the collection.

./zkcli.sh --zkhost localhost:2181 --cmd linkconfig --collection examplecoll --confname examplecollcfg

Here "examplecoll" is the name of the collection we're going to create.

You can skip this step if you specify the configuration name in the command you use to create the collection. You can see an example some sections below.

Create the collection on the cluster leader nodes

The next command creates the cores representing the collection shards on the leader nodes.

curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=examplecoll&numShards=2&replicationFactor=1&maxShardsPerNode=2'

The pattern Solr uses to name the cores is as follows:

<collection>_shard<shardnumber>_replica<replicanumber>

and for the leaders <replicanumber> is 1.

In detail:

numShards: number of shards to split the collection index into;
replicationFactor: limits the number of replicas created while creating the collection. This way you can limit the number of cluster nodes (Solr servers) to be used for the collection. For example the collection can use only 10 or 20 nodes in a cluster composed by 100 nodes;
maxShardsPerNode: maximum number of shards that can be hosted on a single node.

In the example above replicationFactor is 1: this way the shards are created only on leader nodes and no replicas are associated with them. In the next section we'll see how to manually add replicas to the shards. I prefer to create replicas by hand because I have much more control on which nodes are used and I can choose how to name the single cores.

The complete reference for this and other commands related to the Collections API is here:

https://cwiki.apache.org/confluence/display/solr/Collections+API

Bind the collection to its configuration + Create the collection on the cluster leader nodes

As I said before, it's possible to create a collection and bind it to its configuration in a single step adding the collection.configName parameter to the command used to create a collection.

curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=examplecoll&numShards=2&replicationFactor=1&maxShardsPerNode=2&collection.configName=examplecollcfg'

Create the collection on the replicas

On the replicas the cores representing the collection shards should be created by hand.

Note: the command to create a collection (and all other commands belonging to the Collections API) can be executed on whatever cluster node you choose. Commands related to single cores or replicas must be executed on the node hosting the core/replica.

Having created a 2 shards collection (numShards=2) we're going to create a replica for each of the shards:

curl 'http://localhost:7500/solr/admin/cores?action=CREATE&name=examplecoll_shard1_replica2&collection=examplecoll&shard=shard1'

curl 'http://localhost:8900/solr/admin/cores?action=CREATE&name=examplecoll_shard2_replica2&collection=examplecoll&shard=shard2'

In my setup the Solr servers are hosted on the same physical machine, so I had to change the ports on which they're listening: the nodes on which we're creating the replicas listen to the ports 8900 and 7500.

The core name can be anything: in this example I used the same pattern used by Solr. Mandatory parameters are the name of the collection to which the cores are bound and the id of the shard which the core will be the replica of.

Some tasks can be accomplished also using the collection API instead of using the core API as I did. This article shows how:

http://heliosearch.org/solrcloud-assigning-nodes-machines

Non maskable interrupt

Search This Blog