DronaBlog

Monday, August 10, 2020

Elastic Search Concepts - Cluster, Node, Index, Document, Shard and Replica

 Are you looking for detailed information about the various concepts used in Elastic Search? Are you also interested in knowing what is Document, Shard, and Replica in Elastic Search? If so, then you reached the right place. In this article, we will understand all the important concepts which are more commonly used in the Elastic Search.


A. Elastic Search Cluster

The Cluster is a collection of nodes. It has a unique name. If we do not provide any name to the cluster then it defaults as elasticsearch. We can create clusters specific to each environment. for example, we can development cluster or QA clusters or production clusters. We can create clusters with more than one node, however, it is totally okay if we have just one node in a cluster. The cluster provides indexing and searchable capabilities across all the nodes. i.e. when we perform search or index a data we do not have worry about on which node the data is getting indexed or searched.


B. The node in Elastic Search

The node is a single server which is the part of the cluster that stores the data. Node has a unique name as like a cluster name. Node provides important capabilities such as search and index which is part of a cluster. An important thing to remember is the node names are in all lower case. We can create as many nodes as we want. There is no limit on it. If a cluster has more than one node than each node contains a subset of data.




C. Index

So, what is an index? As we know the nodes contain indices and an index is a collection of similar documents. for example, the document can be customer information or production information. In short for each type of document we create the index. The index name is in lowercase. The index name is used for indexing, searching, updating, deleting documents within an index. We can create n number of indices in a cluster.


D. Category or Type in Elastic Search

Inside each index, we have a type it is nothing but a category. We can create multiple categories such as Customer, Product, Vendor, Supplier, Broker, etc. Assume that our index name is the customer then we can create categories such as Individual, Organization, Self Proprietor, etc. Under each category, we can have document. The type has a name and associated with mapping. We create a separate mapping for each type of index. Here is some additional note about category or type. As we know Elastic search is built on Lucene and in Lucene there is no concept of type or category. The category is stored as _type in the metadata. while search document of a particular type, elastic search applies a filter on this field




E. Mapping in Elastic Search

The mapping describes fields and their types. e.g. data types such as string, integer, date, geo, etc. It also contains details about how each field will be indexed and stored. In many cases we don't have to create mapping explicitly, it is called dynamic mapping.


F. Document

The document is the base unit of information in the Elastic Search. The document contains fields with key/value pair values. The value can be of any data type such as string, date, integer which is defined in the mapping.   It could be a single Customer or Product or vendor etc. The document is in JSON format and it physically resides in the index which we create. We can as many documents as we need in a given index.


G. Shard

The shard is a portion of that index. We can divide index into multiple pieces i.e. shards which will be helpful if we have large set data to store on the physical disk. If the physical disk does not have enough capacity then we can divide the index into multiple pieces.  each shard is a fully functional index in its own. By default while creating an index we create five shards, however, we can configure as many shards as we need. In short, shards are created to achieve scalability.




H. Replica

The replica is a segment of an index or a copy of the shard. We never locate a replica on the same node where the primary shard is present so that when one node goes down, another node will be helpful for recovery. By default, while creating an index we create only one replica. Assume that we have two nodes, in that case, we will have five replica shards and five primary shards across two nodes. So replica's are helpful to achieve high availability. An important thing to note about replica is - Search queries can be executed on all replicas in parallel.



No comments:

Post a Comment

Please do not enter any spam link in the comment box.

Understanding Survivorship in Informatica IDMC - Customer 360 SaaS

  In Informatica IDMC - Customer 360 SaaS, survivorship is a critical concept that determines which data from multiple sources should be ret...