DronaBlog

Thursday, August 20, 2020

Elastic Search - Types of Analyzer in the Elastic Search

 Do you know how many types of analyzers available in the Elastic Search? Are you looking for the details about all the analyzers come with Elastic Search? If so, then you reached the right place. In this article, we will discuss the types of analyzes which are more commonly used in Elastic Search.




What is an Analyzer?

An analyzer is a package which contains three lower-level building blocks: character filters, tokenizers, and token filters which are used for analyzing the data 


Types of Analyzer

Here is a list of analyzer which comes with Elastic Search-

  • Standard Analyzer
  • Simple Analyzer
  • Whitespace Analyzer
  • Stop Analyzer
  • Keyword Analyzer
  • Pattern Analyzer
  • Language Analyzers
  • Fingerprint Analyzer


Understanding Analyzers

  • Standard Analyzer

The text gets divided into terms of word boundaries in a standard analyzer. The punctuations are removed and the upper case is converted into lowercase. It also supports removing stop words.

e.g 

Input: "This is a sample example, for STANDARD-Aanlyzer"

Output:[this, is, a, sample, example, for, standard, analyzer]


  • Simple Analyzer

With Simple Analyzer, the text is divided into separate terms whenever non-letter character appears. The non-letter character can be number, hyphens, and space, etc. The upper case characters are converted into lowercase. 

Input: "My dog's name is Rocky-Hunter"

Output:[my, dog, s, name, is, rocky, hunter]


  • Whitespace Analyzer

The input phrase is divided into terms based on whitespace. It does not lowercase terms.

Input: "Technology-World has articles on ElasticSearch and Artificial-Intelligence etc."

Output:[Technology-World, has, articles, on, ElasticSearch, and,  Artificial-Intelligence, etc.]


  • Stop Analyzer

A stop analyzer is a form of  Simple Analyzer where the text is divided into separate terms whenever non-letter characters encountered. The non-letter character can be number, hyphens, and space, etc.  Like Simple analyzer in Stop Analyzer, the upper case characters are converted into lowercase. Additionally, it removed the stop words. Assume that stop word file includes work 'the', 'is', 'of', 

Input: "Gone with the wind is one of my favorite books."

Output:[Gone, with, wind, one, my, favorite, books]




  • Keyword Analyzer

The input phrase is NOT divided into terms rather output phrase/token is the same as the input phrase.

Input: "Mount Everest is one of the worlds natural wonders"

Output:[Mount Everest is one of the worlds natural wonders]


  • Pattern Analyzer

The regular expression is used in the pattern analyzer to split the text into terms. The default regular expression is \W+  which is nothing but all non-word characters. We need to remember that the regular expression is used as a term separator in the input phrase. The upper case characters are converted into lower case, also the stop words are removed.

Input: "My daughter's name is Rita and she is 7 years old"

Output:[my, daughter, s, name, is, Rita, and, she, is, 7, years, old]


  • Language Analyzers

The language-specific such as English, French, Hindi are provided in the Elasticsearch. 

Here is a sample keyword from the Hindi language analyzer.

e.g. "keywords": ["उदाहरण"]


  • Fingerprint Analyzer

The fingerprint analyzer is used for duplicate detection. The input phrase is converted into lowercase, the extended characters are removed. The duplicate words are removed and a single toke is created. It also supports stop words.

Input: "á is a Spanish accents character"

Output:[a, accents, character, is, spanish]


Learn more about Elastic Search here




No comments:

Post a Comment

Please do not enter any spam link in the comment box.

Understanding Survivorship in Informatica IDMC - Customer 360 SaaS

  In Informatica IDMC - Customer 360 SaaS, survivorship is a critical concept that determines which data from multiple sources should be ret...