Do you know how many types of analyzers available in the Elastic Search? Are you looking for the details about all the analyzers come with Elastic Search? If so, then you reached the right place. In this article, we will discuss the types of analyzes which are more commonly used in Elastic Search.
What is an Analyzer?
An analyzer is a package which contains three lower-level building blocks: character filters, tokenizers, and token filters which are used for analyzing the data
Types of Analyzer
Here is a list of analyzer which comes with Elastic Search-
- Standard Analyzer
- Simple Analyzer
- Whitespace Analyzer
- Stop Analyzer
- Keyword Analyzer
- Pattern Analyzer
- Language Analyzers
- Fingerprint Analyzer
Understanding Analyzers
- Standard Analyzer
The text gets divided into terms of word boundaries in a standard analyzer. The punctuations are removed and the upper case is converted into lowercase. It also supports removing stop words.
e.g
Input: "This is a sample example, for STANDARD-Aanlyzer"
Output:[this, is, a, sample, example, for, standard, analyzer]
- Simple Analyzer
With Simple Analyzer, the text is divided into separate terms whenever non-letter character appears. The non-letter character can be number, hyphens, and space, etc. The upper case characters are converted into lowercase.
Input: "My dog's name is Rocky-Hunter"
Output:[my, dog, s, name, is, rocky, hunter]
- Whitespace Analyzer
The input phrase is divided into terms based on whitespace. It does not lowercase terms.
Input: "Technology-World has articles on ElasticSearch and Artificial-Intelligence etc."
Output:[Technology-World, has, articles, on, ElasticSearch, and, Artificial-Intelligence, etc.]
- Stop Analyzer
A stop analyzer is a form of Simple Analyzer where the text is divided into separate terms whenever non-letter characters encountered. The non-letter character can be number, hyphens, and space, etc. Like Simple analyzer in Stop Analyzer, the upper case characters are converted into lowercase. Additionally, it removed the stop words. Assume that stop word file includes work 'the', 'is', 'of',
Input: "Gone with the wind is one of my favorite books."
Output:[Gone, with, wind, one, my, favorite, books]
- Keyword Analyzer
The input phrase is NOT divided into terms rather output phrase/token is the same as the input phrase.
Input: "Mount Everest is one of the worlds natural wonders"
Output:[Mount Everest is one of the worlds natural wonders]
- Pattern Analyzer
The regular expression is used in the pattern analyzer to split the text into terms. The default regular expression is \W+ which is nothing but all non-word characters. We need to remember that the regular expression is used as a term separator in the input phrase. The upper case characters are converted into lower case, also the stop words are removed.
Input: "My daughter's name is Rita and she is 7 years old"
Output:[my, daughter, s, name, is, Rita, and, she, is, 7, years, old]
- Language Analyzers
The language-specific such as English, French, Hindi are provided in the Elasticsearch.
Here is a sample keyword from the Hindi language analyzer.
e.g. "keywords": ["उदाहरण"]
- Fingerprint Analyzer
The fingerprint analyzer is used for duplicate detection. The input phrase is converted into lowercase, the extended characters are removed. The duplicate words are removed and a single toke is created. It also supports stop words.
Input: "á is a Spanish accents character"
Output:[a, accents, character, is, spanish]