dreaife

Announcement

welcome to my blog

Learn More

Tags

dreaife

Announcement

welcome to my blog

Learn More

Site Statistics

Posts

71

Categories

13

Tags

58

Total Words

127,637

Running Days

0 days

Last Activity

0 days ago

Tags

dreaife

Announcement

welcome to my blog

Learn More

Site Statistics

Posts

71

Categories

13

Tags

58

Total Words

127,637

Running Days

0 days

Last Activity

0 days ago

Tags

Categories

5078 words

25 minutes

Getting Started with Elasticsearch

2023-08-13

middle-side

elasticSearch

/

java

Getting Started with Elasticsearch#

Understanding ES#

The role of Elasticsearch#

Elasticsearch is a very powerful open-source search engine with many capabilities, which can help us quickly find the content we need from vast amounts of data.

For example:

Search code on GitHub
Search products on e-commerce sites
Search for answers on Baidu
Search for nearby taxis in ride-hailing apps

ELK Stack#

Elasticsearch, together with Kibana, Logstash, and Beats, is the Elastic Stack (ELK). It is widely used in log data analysis, real-time monitoring, and related fields.

And Elasticsearch is the core of the Elastic Stack, responsible for storing, searching, and analyzing data.

Elasticsearch and Lucene#

The underlying implementation of Elasticsearch is based on Lucene.

Lucene is a Java-based search engine library, a top-level project of the Apache Software Foundation, developed by Doug Cutting in 1999.

Elasticsearch history:

In 2004, Shay Banon developed Compass based on Lucene
In 2010, Shay Banon rewrote Compass and named it Elasticsearch.

What is Elasticsearch?

An open-source distributed search engine that can be used to implement search, log statistics, analytics, system monitoring, and more

What is the Elastic Stack (ELK)?

A technology stack centered on Elasticsearch, including Beats, Logstash, Kibana, and Elasticsearch

What is Lucene?

An Apache open-source search engine library that provides the core APIs for search

Inverted Index#

The concept of an inverted index is based on forward indexing, like what is used in MySQL.

Forward Index#

If you create an index on the id in a table, queries based on id will go through the index, and the lookup is very fast.

But if you want to perform fuzzy searches on the title, you can only scan row by row, with the following process:

The user searches data with the condition that the title matches “%phone%”
Retrieve data row by row, for example data with id = 1
Check whether the title in the data matches the user’s search condition
If it matches, add it to the result set; otherwise discard. Go back to step 1

Row-by-row scanning, i.e., full table scan, becomes slower as data volume grows. When data volume reaches millions, it becomes a disaster.

Inverted Index#

There are two very important concepts in inverted indexes:

Document: the data used for searching; each item is a document. For example, a webpage, a product description
Term: a meaningful word produced by tokenizing the document data or the user search data using some algorithm

Creating an inverted index is a special treatment of forward indexing. The process is:

Tokenize each document’s data using an algorithm to obtain terms
Create a table where each row includes a term, the document id where the term resides, position, etc.
Because terms are unique, you can create an index on terms, such as a hash-table index

The search process for an inverted index (using the query for “Xiaomi phone” as an example):

The user enters the query “Xiaomi phone” to search.
Tokenize the user input to obtain terms: Xiaomi, phone.
Look up the terms in the inverted index to obtain document ids that contain the terms: 1, 2, 3.
Use the document ids to look up the actual documents in the forward index.

Although you first query the inverted index, then the forward index, both the terms and the document ids have indexes, so the query is very fast—no full table scans.

Forward vs Inverted#

So why is one called forward index and the other inverted index?

Forward index is the traditional approach, indexed by id. But when querying by terms, you must first retrieve each document one by one, then check whether the document contains the needed terms. This is a process of finding terms from documents.
Inverted index is the opposite: first find the terms the user wants to search for, obtain the document ids containing those terms, then retrieve the documents by id. This is a process of finding documents from terms.

Forward index:

Advantages:
- You can create indexes on multiple fields
- Search and sort by indexed fields are very fast
Disadvantages:
- For non-indexed fields, or when querying by a subset of terms in an indexed field, you may need a full table scan

Inverted index:

Advantages:
- Very fast for term-based and fuzzy searches
Disadvantages:
- You can only index terms, not fields
- Cannot sort by fields

Some concepts in ES#

Elasticsearch has many unique concepts, somewhat different from MySQL, but with similarities.

Documents and Fields#

Elasticsearch stores data as documents. A document can be a database row of product data or an order record. Document data is serialized to JSON when stored in Elasticsearch.

JSON documents typically contain many fields, similar to columns in a database.

Index and Mapping#

Index is the collection of documents of the same type.

For example:

All user documents can be organized together as the user index
All product documents can be organized together as the product index
All order documents can be organized together as the order index

Therefore, an index can be treated as a table in a database.

A database table has constraints that define its structure, field names, types, and so on. Therefore, the index has mapping, which is the field constraint information for documents in the index, similar to the structure of a table.

MySQL vs Elasticsearch#

MySQL	Elasticsearch	Notes
Table	Index	An index is a collection of documents, similar to a table in a database
Row	Document	A document is a row of data, JSON-formatted
Column	Field	A field in a JSON document, similar to a database column
Schema	Mapping	Mapping defines field types and constraints, like a table schema
SQL	DSL	DSL is Elasticsearch’s JSON-style request language for CRUD

Both have their strengths:

MySQL: strong for transactional operations, ensuring data safety and consistency
Elasticsearch: strong for searching, analyzing, and computing large-scale data

In enterprises, they are often used together:

Use MySQL for write operations requiring strong safety
Use Elasticsearch for search needs requiring high query performance
Then implement data synchronization between the two to ensure consistency

Installation#

Install Elasticsearch and Kibana#

1
docker run -d \\
2
  --name es \\
3
    -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \\
4
    -e "discovery.type=single-node" \\
5
    -v es-data:/usr/share/elasticsearch/data \\
6
    -v es-plugins:/usr/share/elasticsearch/plugins \\
7
    --privileged \\
8
    --network es-net \\
9
    -p 9200:9200 \\
10
    -p 9300:9300 \\
11
elasticsearch:8.8.1
12

13
# If ports won't open, remember to disable SSL and password authentication
14
xpack.security.enabled: false
15
xpack.security.http.ssl:
16
  enabled: false
17
  keystore.path: certs/http.p12
18

19
docker run -d \\
20
--name kibana \\
21
-e ELASTICSEARCH_HOSTS=http://es:9200 \\
22
--network=es-net \\
23
-p 5601:5601  \\
24
kibana:8.8.1

Install IK Analyzer#

1
docker exec -it es bash
2

3
./bin/elasticsearch-plugin install <https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.8.1/elasticsearch-analysis-ik-8.8.1.zip>
4

5
exit
6
#Restart the container
7
docker restart elasticsearch

IKAnalyzer.cfg.xml configuration content:

1
<?xml version="1.0" encoding="UTF-8"?>
2
<!DOCTYPE properties SYSTEM "<http://java.sun.com/dtd/properties.dtd>">
3
<properties>
4
        <comment>IK Analyzer extension configuration</comment>
5
        <!-- Users can configure their own extension dictionaries here *** add extension dictionary -->
6
        <entry key="ext_dict">ext.dic</entry>
7
      <!-- Users can configure their own extension stopword dictionary here  *** add stopword dictionary -->
8
        <entry key="ext_stopwords">stopword.dic</entry>
9
</properties>

After editing the corresponding file, restart.

What is the role of the tokenizer?

Tokenize documents when creating the inverted index
Tokenize user input when searching

What modes does the IK tokenizer have?

ik_smart: Smart segmentation, coarse granularity
ik_max_word: Finest segmentation, fine granularity

How to extend terms for IK tokenizer? How to disable terms?

Use the IkAnalyzer.cfg.xml file in the config directory to add extension dictionaries and stopword dictionaries
Add extended terms or stopwords in the dictionaries

Index management#

An index is similar to a database table, and mapping is similar to the table structure.

To store data in ES, you must first create an “index” and a “mapping”.

Mapping properties#

Mapping constrains the documents in an index. Common mapping properties include:

type: field data type; common simple types include:
- Strings: text (tokenizable text), keyword (exact values, e.g., brand, country, IP address)
- Numeric: long, integer, short, byte, double, float
- Boolean: boolean
- Date: date
- Object: object
index: whether to create an index; default true
analyzer: which analyzer to use
properties: sub-fields of this field

CRUD for index management#

Create an index: PUT /index_name
Get an index: GET /index_name
Delete an index: DELETE /index_name
Add fields: PUT /index_name/_mapping

Create index and mapping#

Basic syntax:

Method: PUT
Path: /index_name (customizable)
Parameters: mapping

Format:

1
PUT /IndexName
2
{
3
  "mappings": {
4
    "properties": {
5
      "fieldName": {
6
        "type": "text",
7
        "analyzer": "ik_smart"
8
      },
9
      "fieldName2": {
10
        "type": "keyword",
11
        "index": "false"
12
      },
13
      "fieldName3": {
14
        "properties": {
15
          "subfield": {
16
            "type": "keyword"
17
          }
18
        }
19
      },
20
      // ... omitted
21
    }
22
  }
23
}

Query index#

Basic syntax:

Method: GET
Path: /IndexName
Parameters: none

Format:

1
GET /IndexName

Modify index#

Although the inverted index structure is not complex, if the data structure changes (for example, changing the tokenizer), you would need to recreate the inverted index. This is why an index’s mapping cannot be modified after creation.

Although you cannot modify existing fields in the mapping, you can add new fields to the mapping without affecting the inverted index.

Syntax:

1
PUT /IndexName/_mapping
2
{
3
  "properties": {
4
    "newFieldName": {
5
      "type": "integer"
6
    }
7
  }
8
}

Delete index#

Syntax:

Method: DELETE
Path: /IndexName
Parameters: none

Format:

1
DELETE /IndexName

Document operations#

What document operations exist?

Create a document: POST /{IndexName}/_doc/{id} { json document }
Get a document: GET /{IndexName}/_doc/{id}
Delete a document: DELETE /{IndexName}/_doc/{id}
Update a document:
- Full update: PUT /{IndexName}/_doc/{id} { json document }
- Partial update: POST /{IndexName}/_update/{id} { “doc”: {field}}

Create a new document#

Syntax:

1
POST /{IndexName}/_doc/{id}
2
{
3
    "field1": "value1",
4
    "field2": "value2",
5
    "field3": {
6
        "subProperty1": "value3",
7
        "subProperty2": "value4"
8
    },
9
    // ...
10
}

Query a document#

Following REST conventions, creation uses POST, retrieval uses GET, but queries usually require conditions; here we include the document id.

Syntax:

1
GET /{IndexName}/_doc/{id}

Delete a document#

Deletion uses a DELETE request and you delete by id:

Syntax:

1
DELETE /{IndexName}/_doc/{id}

Update a document#

There are two ways to update:

Full update: essentially delete by id, then add
Partial update: modify specific fields in the document

In the RestClient API, full update and add use the same API; the difference is based on the ID:

If adding and the ID already exists, it is an update
If adding and the ID does not exist, it is an addition

We won’t go into detail here; we focus on partial updates.

Prepare the Request object. This time it’s an UpdateRequest
Prepare the parameters. The JSON document contains the fields to be updated
Update the document. Here we call client.update()

Unit test:

1
@Test
2
void testUpdateDocument() throws IOException {
3
    // 1. Prepare Request
4
    UpdateRequest request = new UpdateRequest("IndexName", "61083");
5
    // 2. Prepare request parameters
6
    request.doc(
7
        "price", "952",
8
        "starName", "四钻"
9
    );
10
    // 3. Send the request
11
    client.update(request, RequestOptions.DEFAULT);
12
}

Bulk import documents#

Case: use BulkRequest to bulk import data from the database into the index.

Steps:

Use MyBatis-Plus to query hotel data
Convert queried hotels (Hotel) to document type data (HotelDoc)
Use BulkRequest to batch add documents

Bulk processing with BulkRequest essentially groups multiple CRUD requests and sends them together. It provides an add method to add other requests:

IndexRequest: insert
UpdateRequest: update
DeleteRequest: delete

Unit test:

1
@Test
2
void testBulkRequest() throws IOException {
3
    // Bulk query hotel data
4
    List<Hotel> hotels = hotelService.list();
5

6
    // 1. Create Request
7
    BulkRequest request = new BulkRequest();
8
    // 2. Prepare parameters; add multiple insert requests
9
    for (Hotel hotel : hotels) {
10
        // 2.1 Convert to document type HotelDoc
11
        HotelDoc hotelDoc = new HotelDoc(hotel);
12
        // 2.2 Create a request to add a new document
13
        request.add(new IndexRequest("hotel")
14
                    .id(hotelDoc.getId().toString())
15
                    .source(JSON.toJSONString(hotelDoc), XContentType.JSON));
16
    }
17
    // 3. Send the request
18
    client.bulk(request, RequestOptions.DEFAULT);
19
}

DSL Querying documents#

Elasticsearch queries are still implemented using a JSON-style DSL.

DSL query categories#

Elasticsearch provides a JSON-based DSL (Domain Specific Language) to define queries. Common query types include:

Match all: query all data; usually used for testing. Example: match_all
Full-text search: tokenize user input via an analyzer, then match against the inverted index. Examples:
- match_query
- multi_match_query
Exact queries: search by exact terms for fields like keyword, numeric, date, boolean, etc. Examples:
- ids
- range
- term
Geo queries: geographic queries. Examples:
- geo_distance
- geo_bounding_box
Compound queries: combine multiple queries for more complex search logic. Examples:
- bool
- function_score

The query syntax is generally consistent:

1
GET /indexName/_search
2
{
3
  "query": {
4
    "queryType": {
5
      "queryField": "value"
6
    }
7
  }
8
}

Full-text search#

The basic flow for full-text search is:

Tokenize the user’s search content into terms
Use the terms to match in the inverted index and get document ids
Retrieve documents by id and return them

Common scenarios include:

E-commerce site search boxes
Baidu search box

Common full-text search queries include:

match query: single-field search

1
GET /indexName/_search
2
{
3
  "query": {
4
    "match": {
5
      "FIELD": "TEXT"
6
    }
7
  }
8
}

multi_match query: multi-field search; a match on any field qualifies the query; the more fields involved, the slower the query

1
GET /indexName/_search
2
{
3
  "query": {
4
    "multi_match": {
5
      "query": "TEXT",
6
      "fields": ["FIELD1", " FIELD12"]
7
    }
8
  }
9
}

Exact queries#

Exact queries usually target keyword, numeric, date, boolean type fields, so they are not tokenized. Common examples:

term: exact value on a term; used for keyword, numeric, boolean, date fields

Because the field is not tokenized, the query value must also be a non-tokenized term. If the user input does not match exactly, results may not be found.

1
// term query
2
GET /indexName/_search
3
{
4
  "query": {
5
    "term": {
6
      "FIELD": {
7
        "value": "VALUE"
8
      }
9
    }
10
  }
11
}

range: range queries for numeric or date types

1
// range query
2
GET /indexName/_search
3
{
4
  "query": {
5
    "range": {
6
      "FIELD": {
7
        "gte": 10, // gte means greater than or equal; gt would be greater than
8
        "lte": 20 // lte means less than or equal; lt would be less than
9
      }
10
    }
11
  }
12
}

Geo queries#

Geographic queries are queries based on latitude and longitude.

Common scenarios include:

Travel sites: search for hotels near me
Ride-hailing: search for taxis near me
WeChat: search for nearby people
Bounding box queries

Bounding box queries select documents whose geo_point fields fall within a rectangle defined by two points (top_left and bottom_right).

1
// geo_bounding_box query
2
GET /indexName/_search
3
{
4
  "query": {
5
    "geo_bounding_box": {
6
      "FIELD": {
7
        "top_left": { // top-left point
8
          "lat": 31.1,
9
          "lon": 121.5
10
        },
11
        "bottom_right": { // bottom-right point
12
          "lat": 30.9,
13
          "lon": 121.7
14
        }
15
      }
16
    }
17
  }
18
}

Nearby (geo_distance) queries define a center point and a radius; all documents within the distance are returned.

1
// geo_distance query
2
GET /indexName/_search
3
{
4
  "query": {
5
    "geo_distance": {
6
      "distance": "15km", // radius
7
      "FIELD": "31.21,121.5" // center
8
    }
9
  }
10
}

Compound queries#

Compound queries combine other queries to implement more complex search logic. Two common forms:

function_score: score-based queries to control relevance
bool query: boolean combination of other queries

Relevance scoring#

When using a match query, documents are scored by their relevance (_score) and results are returned in descending order of score.

Historically, TF-IDF was used, with formulas such as:

TF(term frequency) = (number of occurrences of the term) / (total number of terms in the document)

IDF(inverse document frequency) = Log(total number of documents / number of documents containing the term)

score = sum of TF × IDF

In later versions, BM25 was introduced, with a formula like:

Score(Q,d) = sum over i of log(1 + (N - n + 0.5) / (n + 0.5)) × (f_i / (f_i + k1 × (1 - b + b × dl / avgdl)))

TF-IDF has a drawback: as term frequency increases, the document score increases for a single term. BM25 provides a ceiling and a smoother curve.

Function score queries#

Using function_score to influence scoring can be important when the product needs control over relevance, e.g., the Baidu ranking example.

A function_score query contains four parts:

Original query: the query condition; search and assign the original score (query score) based on BM25
Filter: documents that meet the filter condition will be re-scored
Score functions: for documents meeting the filter, apply the function score; four types:
- weight: the function result is a constant
- field_value_factor: use a field’s value as the function result
- random_score: use a random value as the function result
- script_score: a custom scoring function
Boost mode: how to combine function score with the original query score; options include:
- multiply
- replace
- sum, avg, max, min, etc.

The flow:

Query documents with the original condition and compute the initial score (query score)
Filter documents
For documents that pass the filter, compute the function score
Combine the query score and function score according to the boost_mode to obtain the final relevance score

1
GET /hotel/_search
2
{
3
  "query": {
4
    "function_score": {
5
      "query": {  .... }, // original query
6
      "functions": [ // scoring functions
7
        {
8
          "filter": { // condition to match
9
            "term": {
10
              "brand": "如家"
11
            }
12
          },
13
          "weight": 2 // scoring weight
14
        }
15
      ],
16
      "boost_mode": "sum" // how to combine
17
    }
18
  }
19
}

What are the three elements defined by a function_score query?

Filter: which documents should be scored
Score function: how to calculate the function score
Boost mode: how to combine function score with the query score

Bool query#

Bool query combines one or more sub-queries. Each sub-query is a sub-clause. Sub-clauses can be combined as:

must: must match each sub-query (AND)
should: optionally match sub-queries (OR)
must_not: must not match; does not participate in scoring (NOT)
filter: must match; does not participate in scoring

Note that the more fields participate in scoring, the worse the query performance. For multi-criteria searches, consider:

Keyword search in the search box uses a full-text query with must (participates in scoring)
Other filters use filter (do not participate in scoring)

1
GET /hotel/_search
2
{
3
  "query": {
4
    "bool": {
5
      "must": [
6
        {"term": {"city": "上海" }}
7
      ],
8
      "should": [
9
        {"term": {"brand": "皇冠假日" }},
10
        {"term": {"brand": "华美达" }}
11
      ],
12
      "must_not": [
13
        { "range": { "price": { "lte": 500 } } }
14
      ],
15
      "filter": [
16
        { "range": {"score": { "gte": 45 } } }
17
      ]
18
    }
19
  }
20
}

Processing search results#

Search results can be processed or displayed according to user preferences.

Sorting#

By default, Elasticsearch sorts by relevance score (_score), but you can sort in custom ways. Sortable field types include: keyword, numeric, geo_point, date, etc.

Plain field sorting

The syntax for sorting by keyword, numeric, and date types is basically the same.

1
GET /indexName/_search
2
{
3
  "query": {
4
    "match_all": {}
5
  },
6
  "sort": [
7
    {
8
      "FIELD": "desc"  // sort field, sort direction ASC or DESC
9
    }
10
  ]
11
}

The sort criteria are an array, so you can specify multiple sort conditions. They are applied in the order declared; if the first condition is equal, then the second, and so on.

Geo distance sorting

Geo distance sorting is a bit different.

1
GET /indexName/_search
2
{
3
  "query": {
4
    "match_all": {}
5
  },
6
  "sort": [
7
    {
8
      "_geo_distance" : {
9
          "FIELD" : "latitude, longitude", // geo_point field name, target coordinates
10
          "order" : "asc", // sort order
11
          "unit" : "km" // distance unit
12
      }
13
    }
14
  ]
15
}

This query means:

Specify a coordinate as the target point
For every document, compute the distance between the coordinate in the specified field (which must be geo_point) and the target point
Sort by distance

Pagination#

By default, Elasticsearch returns only the top 10 results. To fetch more, adjust from and size:

from: which document index to start from
size: how many documents to return

Similar to MySQL’s LIMIT ?, ?

The basic pagination syntax:

1
GET /hotel/_search
2
{
3
  "query": {
4
    "match_all": {}
5
  },
6
  "from": 0, // starting offset; default 0
7
  "size": 10, // number of documents to retrieve
8
  "sort": [
9
    {"price": "asc"}
10
  ]
11
}

When deep pagination is used, large result sets can strain memory and CPU, so Elasticsearch forbids from + size exceeding 10000.

For deep pagination, ES offers two approaches:

search after: requires sorting; starts from the last sort value to fetch the next page. Official recommended approach.
scroll: creates a snapshot of the sorted results and keeps it in memory. Official guidance is not to use it for new developments.

Common pagination approaches and their pros/cons:

from + size:
- Pros: supports random page navigation
- Cons: depth pagination limit (from + size) is 10000 by default
- Use case: search pages with random access (Baidu, JD, Google, Taobao)
after search:
- Pros: no hard limit (per-query size should not exceed 10000)
- Cons: only forward paging; no random access
- Use case: pages that do not require random access
scroll:
- Pros: no hard limit (per-query size should not exceed 10000)
- Cons: extra memory consumption, and results are not real-time
- Use case: retrieving large datasets, migrations
- Not recommended since ES 7.1; use after search instead.

Highlighting#

When we search Baidu or JD, keywords appear highlighted in red; this is highlighting.

Highlighting is implemented in two steps:

Add tags around all keywords in the document, e.g., tags

Apply CSS styling to the tags on the page

Highlight syntax:

1
GET /hotel/_search
2
{
3
  "query": {
4
    "match": {
5
      "FIELD": "TEXT" // query, highlighting must be used with full-text search
6
    }
7
  },
8
  "highlight": {
9
    "fields": { // specify fields to highlight
10
      "FIELD": {
11
        "pre_tags": "<em>",  // tag before highlighted text
12
        "post_tags": "</em>" // tag after highlighted text
13
      }
14
    }
15
  }
16
}

Notes:

Highlighting highlights keywords; the search query must contain keywords, not range queries

By default, highlighted fields must match the fields specified in the search; otherwise, highlighting will not occur

To highlight non-search fields, set required_field_match=false

RestClient query documentation#
Querying with RestClient follows the same pattern as with RestHighLevelClient. The core is to obtain an index’s operations via the client.indices() object.
Document operations follow these basic steps:

Initialize RestHighLevelClient

Create XxxRequest. XXX can be Index, Get, Update, Delete, Bulk

Prepare parameters (for Index, Update, Bulk)

Send the request. Call RestHighLevelClient#xxx() where xxx is index, get, update, delete, bulk

Parse the results (Get requires parsing)

Quick start#
1 @Test 2 void testMatchAll() throws IOException { 3 // 1. Prepare Request 4 SearchRequest request = new SearchRequest("hotel"); 5 // 2. Prepare DSL 6 request.source() 7 .query(QueryBuilders.matchAllQuery()); 8 // 3. Send the request 9 SearchResponse response = client.search(request, RequestOptions.DEFAULT); 10 11 // 4. Parse the response 12 handleResponse(response); 13 } 14 15 private void handleResponse(SearchResponse response) { 16 // 4. Parse the response 17 SearchHits searchHits = response.getHits(); 18 // 4.1. Get total hits 19 long total = searchHits.getTotalHits().value; 20 System.out.println("Total hits: " + total); 21 // 4.2. Documents array 22 SearchHit[] hits = searchHits.getHits(); 23 // 4.3. Iterate 24 for (SearchHit hit : hits) { 25 // Get document source 26 String json = hit.getSourceAsString(); 27 // Deserialize 28 HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class); 29 System.out.println("hotelDoc = " + hotelDoc); 30 } 31 }

Step 1: Create a SearchRequest, specifying the index name

Step 2: Use request.source() to build the DSL, which can include queries, pagination, sorting, highlighting, etc.

query(): represents the query condition; use QueryBuilders.matchAllQuery() to build a match_all DSL; QueryBuilders includes match, term, function_score, bool, and other queries

Step 3: Use client.search() to send the request and obtain the response

Elasticsearch returns a JSON string with the following structure:

hits: the matched results

total: total number of hits; the value is the actual total

max_score: the highest relevance score among the results

hits: array of documents; each document is a JSON object

_source: the original document data, also a JSON object

Therefore, parsing the response means parsing the JSON string layer by layer:

SearchHits: obtained via response.getHits(); this is the outermost hits in the JSON, representing matched results

SearchHits#getTotalHits().value: obtain total count

SearchHits#getHits(): get the SearchHit array, i.e., the documents array

SearchHit#getSourceAsString(): obtain the _source from the document result, i.e., the original JSON document

match query#
Full-text match and multi_match queries have APIs similar to that of match_all; the difference lies in the query portion.
Therefore, the Java code differences are mainly in the parameters of request.source().query(), still using the methods provided by QueryBuilders
1 @Test 2 void testMatch() throws IOException { 3 // 1. Prepare Request 4 SearchRequest request = new SearchRequest("hotel"); 5 // 2. Prepare DSL 6 request.source() 7 .query(QueryBuilders.matchQuery("all", "如家")); 8 // 3. Send request 9 SearchResponse response = client.search(request, RequestOptions.DEFAULT); 10 // 4. Parse the response 11 handleResponse(response); 12 13 }
Exact queries#
Exact queries are mainly about:

term: exact match on a term

range: range query

Compared with other queries, the difference is in the query condition; the rest of the code remains the same.
1 // term query 2 QueryBuilders.termQuery("city","杭州"); 3 4 // range query 5 QueryBuilders.rangeQuery("price").gte(100).lte(150);
Bool queries#
Bool queries combine other queries with must, must_not, filter, etc.
You can see that the API differences lie in how the query is constructed via QueryBuilders; the result parsing and other code remain unchanged.
1 @Test 2 void testBool() throws IOException { 3 // 1. Prepare Request 4 SearchRequest request = new SearchRequest("hotel"); 5 // 2. Prepare DSL 6 // 2.1 Prepare BooleanQuery 7 BoolQueryBuilder boolQuery = QueryBuilders.boolQuery(); 8 // 2.2 Add term 9 boolQuery.must(QueryBuilders.termQuery("city", "杭州")); 10 // 2.3 Add range 11 boolQuery.filter(QueryBuilders.rangeQuery("price").lte(250)); 12 13 request.source().query(boolQuery); 14 // 3. Send request 15 SearchResponse response = client.search(request, RequestOptions.DEFAULT); 16 // 4. Parse the response 17 handleResponse(response); 18 }
Sorting, pagination#
Sorting and pagination for search results are set at the same level as the query, so you also use request.source() to configure them.
1 @Test 2 void testPageAndSort() throws IOException { 3 // Page number and page size 4 int page = 1, size = 5; 5 6 // 1. Prepare Request 7 SearchRequest request = new SearchRequest("hotel"); 8 // 2. Prepare DSL 9 // 2.1 query 10 request.source().query(QueryBuilders.matchAllQuery()); 11 // 2.2 sort 12 request.source().sort("price", SortOrder.ASC); 13 // 2.3 pagination from, size 14 request.source().from((page - 1) * size).size(5); 15 // 3. Send request 16 SearchResponse response = client.search(request, RequestOptions.DEFAULT); 17 // 4. Parse the response 18 handleResponse(response); 19 20 }
Highlighting#
Highlighting code differs from prior code in two ways:

The DSL for the query includes highlighting conditions at the same level as the query

The results parsing must also parse the highlighted results

Step 1: Obtain the source with hit.getSourceAsString(); this is non-highlighted JSON; deserialize to HotelDoc

Step 2: Get highlighted results with hit.getHighlightFields(); returns a Map whose key is the highlight field name and value is a HighlightField

Step 3: From the map, get the HighlightField by its name

Step 4: Get fragments from the HighlightField and convert to strings to obtain the highlighted text

Step 5: Replace the non-highlighted text in HotelDoc with the highlighted text

1 @Test 2 void testHighlight() throws IOException { 3 // 1. Prepare Request 4 SearchRequest request = new SearchRequest("hotel"); 5 // 2. Prepare DSL 6 // 2.1 query 7 request.source().query(QueryBuilders.matchQuery("all", "如家")); 8 // 2.2.Highlight 9 request.source().highlighter(new HighlightBuilder().field("name").requireFieldMatch(false)); 10 // 3. Send request 11 SearchResponse response = client.search(request, RequestOptions.DEFAULT); 12 // 4. Parse the response 13 handleResponse(response); 14 } 15 16 private void handleResponse(SearchResponse response) { 17 // 4. Parse the response 18 SearchHits searchHits = response.getHits(); 19 // 4.1 Get total 20 long total = searchHits.getTotalHits().value; 21 System.out.println("Total hits: " + total); 22 // 4.2 Documents array 23 SearchHit[] hits = searchHits.getHits(); 24 // 4.3 Iterate 25 for (SearchHit hit : hits) { 26 // Get document source 27 String json = hit.getSourceAsString(); 28 // Deserialize 29 HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class); 30 // Get highlighted results 31 Map<String, HighlightField> highlightFields = hit.getHighlightFields(); 32 if (!CollectionUtils.isEmpty(highlightFields)) { 33 // Get highlight result by field name 34 HighlightField highlightField = highlightFields.get("name"); 35 if (highlightField != null) { 36 // Get highlighted value 37 String name = highlightField.getFragments()[0].string(); 38 // Overwrite non-highlighted result 39 hotelDoc.setName(name); 40 } 41 } 42 System.out.println("hotelDoc = " + hotelDoc); 43 } 44 }

Heima Travel Case#
Next, we will practice the knowledge learned earlier through the Heima Travel case.
We implement four parts:

Hotel search with pagination

Hotel result filtering

Nearby hotels

Hotel bidding ranking

Hotel search and pagination#
Case requirement: implement Heima Travel’s hotel search feature, including keyword search and pagination
Define entity classes#
There are two: one for the request parameters from the frontend, and one for the response returned by the service.
1 // Request 2 package cn.itcast.hotel.pojo; 3 import lombok.Data; 4 5 @Data 6 public class RequestParams { 7 private String key; 8 private Integer page; 9 private Integer size; 10 private String sortBy; 11 } 12 13 // Response 14 import lombok.Data; 15 import java.util.List; 16 17 @Data 18 public class PageResult { 19 private Long total; 20 private List<HotelDoc> hotels; 21 22 public PageResult() { 23 } 24 25 public PageResult(Long total, List<HotelDoc> hotels) { 26 this.total = total; 27 this.hotels = hotels; 28 } 29 }
Define controller#
Define a HotelController with a query interface that meets the following requirements:

Request method: Post

Path: /hotel/list

Request parameter: an object of type RequestParams

Return value: PageResult, containing two fields

Long total: total count

List<HotelDoc> hotels: hotel data

1 @RestController 2 @RequestMapping("/hotel") 3 public class HotelController { 4 5 @Autowired 6 private IHotelService hotelService; 7 // Search hotel data 8 @PostMapping("/list") 9 public PageResult search(@RequestBody RequestParams params){ 10 return hotelService.search(params); 11 } 12 }
Implement search logic#
We rely on RestHighLevelClient, and we need to register it as a Spring bean in the application.
In the HotelDemoApplication under cn.itcast.hotel, declare this bean:
1 @Bean 2 public RestHighLevelClient client(){ 3 return new RestHighLevelClient(RestClient.builder( 4 HttpHost.create("<http://127.0.0.1:9200>") 5 )); 6 } 7 8 // Service 9 @Override 10 public PageResult search(RequestParams params) { 11 try { 12 // 1. Prepare Request 13 SearchRequest request = new SearchRequest("hotel"); 14 // 2. Prepare DSL 15 // 2.1 query 16 String key = params.getKey(); 17 if (key == null || "".equals(key)) { 18 boolQuery.must(QueryBuilders.matchAllQuery()); 19 } else { 20 boolQuery.must(QueryBuilders.matchQuery("all", key)); 21 } 22 23 // 2.2. Pagination 24 int page = params.getPage(); 25 int size = params.getSize(); 26 request.source().from((page - 1) * size).size(size); 27 28 // 3. Send request 29 SearchResponse response = client.search(request, RequestOptions.DEFAULT); 30 // 4. Parse response 31 return handleResponse(response); 32 } catch (IOException e) { 33 throw new RuntimeException(e); 34 } 35 } 36 37 // Result parsing 38 private PageResult handleResponse(SearchResponse response) { 39 // 4. Parse response 40 SearchHits searchHits = response.getHits(); 41 // 4.1 Get total 42 long total = searchHits.getTotalHits().value; 43 // 4.2 Documents array 44 SearchHit[] hits = searchHits.getHits(); 45 // 4.3 Iterate 46 List<HotelDoc> hotels = new ArrayList<>(); 47 for (SearchHit hit : hits) { 48 // Get document source 49 String json = hit.getSourceAsString(); 50 // Deserialize 51 HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class); 52 // Add to collection 53 hotels.add(hotelDoc); 54 } 55 // 4.4 Wrap and return 56 return new PageResult(total, hotels); 57 }
Hotel results filtering#
Requirement: add filters for brand, city, star, price
In the HotelService’s search method, there is only one place to modify: the query condition inside request.source().query(…). Previously it was a match query by keywords; now we need to add filter conditions, including:

Brand filter: keyword type, using term

Star filter: keyword type, using term

Price filter: numeric type, using range

City filter: keyword type, using term

Multiple conditions should be combined with a boolean query:

The keyword search goes into must to participate in scoring

Other filters go into filter to not participate in scoring

1 private void buildBasicQuery(RequestParams params, SearchRequest request) { 2 // 1. Build BooleanQuery 3 BoolQueryBuilder boolQuery = QueryBuilders.boolQuery(); 4 // 2. Keyword search 5 String key = params.getKey(); 6 if (key == null || "".equals(key)) { 7 boolQuery.must(QueryBuilders.matchAllQuery()); 8 } else { 9 boolQuery.must(QueryBuilders.matchQuery("all", key)); 10 } 11 // 3. City condition 12 if (params.getCity() != null && !params.getCity().equals("")) { 13 boolQuery.filter(QueryBuilders.termQuery("city", params.getCity())); 14 } 15 // 4. Brand condition 16 if (params.getBrand() != null && !params.getBrand().equals("")) { 17 boolQuery.filter(QueryBuilders.termQuery("brand", params.getBrand())); 18 } 19 // 5. Star condition 20 if (params.getStarName() != null && !params.getStarName().equals("")) { 21 boolQuery.filter(QueryBuilders.termQuery("starName", params.getStarName())); 22 } 23 // 6. Price 24 if (params.getMinPrice() != null && params.getMaxPrice() != null) { 25 boolQuery.filter(QueryBuilders 26 .rangeQuery("price") 27 .gte(params.getMinPrice()) 28 .lte(params.getMaxPrice()) 29 ); 30 } 31 // 7. Put into source 32 request.source().query(boolQuery); 33 }
My Nearby Hotels#
Sort nearby hotels by distance based on location coordinates. The approach:

Extend RequestParams to accept a location field

In the search method, if location has a value, add geo_distance sorting

1 GET /indexName/_search 2 { 3 "query": { 4 "match_all": {} 5 }, 6 "sort": [ 7 { 8 "price": "asc" 9 }, 10 { 11 "_geo_distance" : { 12 "FIELD" : "latitude, longitude", 13 "order" : "asc", 14 "unit" : "km" 15 } 16 } 17 ] 18 }
In the search method, add sorting:
1 // 2.3. Sorting 2 String location = params.getLocation(); 3 if (location != null && !location.equals("")) { 4 request.source().sort(SortBuilders 5 .geoDistanceSort("location", new GeoPoint(location)) 6 .order(SortOrder.ASC) 7 .unit(DistanceUnit.KILOMETERS) 8 ); 9 }
Hotel bidding ranking#
Requirement: Let a specified hotel rank at the top of the results, with an advertising tag.
Function_score queries can influence scoring; a higher score leads to higher ranking. A function_score query has three parts:

Filter conditions: which docs get scored

Scoring function: how to compute the function score

Weighting mode: how function score and query score are combined

Here the need is to rank a specified hotel higher, so we add a tag to these hotels and use that in a filter to boost scoring.
We can place the previously written boolean query as the original query condition in the query, then add filter, scoring function, and boost mode:
1 // 2. Scoring control 2 FunctionScoreQueryBuilder functionScoreQuery = 3 QueryBuilders.functionScoreQuery( 4 // Original query, the relevance-scoring query 5 boolQuery, 6 // Array of function_score elements 7 new FunctionScoreQueryBuilder.FilterFunctionBuilder[]{ 8 // One function_score element 9 new FunctionScoreQueryBuilder.FilterFunctionBuilder( 10 // Filter condition 11 QueryBuilders.termQuery("isAD", true), 12 // Scoring function 13 ScoreFunctionBuilders.weightFactorFunction(10) 14 ) 15 }); 16 request.source().query(functionScoreQuery);

Data Aggregation#

Aggregations make it very convenient to perform statistics, analysis, and computations on data. For example:

Which brand of phones is the most popular?

What are the average, maximum, and minimum prices of these phones?

How are these phones selling each month?

These aggregations are much easier and faster than SQL, and can achieve near real-time search results.
Types of aggregations#
There are three common kinds of aggregations:

Bucket aggregations: group documents

TermAggregation: group by field values, e.g., by brand or by country

Date Histogram: group by date intervals, e.g., weekly or monthly

Metric aggregations: compute values like max, min, average

Avg: average

Max: maximum

Min: minimum

Stats: compute max, min, avg, sum, etc.

Pipeline aggregations: base aggregations on the results of other aggregations

Note: The fields participating in aggregations must be keyword, date, numeric, or boolean types.

Implementing aggregations with DSL#
Now we want to count how many hotel brands exist in all data, i.e., group by brand name. This means performing a Bucket aggregation on the hotel brand name.
Bucket aggregation syntax#
1 GET /hotel/_search 2 { 3 "size": 0, // set size to 0 to exclude documents; only return aggregations 4 "aggs": { // define aggregations 5 "brandAgg": { // give the aggregation a name 6 "terms": { // aggregation type: group by brand value 7 "field": "brand", // field participating in aggregation 8 "size": 20 // number of aggregation results 9 } 10 } 11 } 12 }
Sorting aggregation results#
By default, a Bucket aggregation counts documents in each bucket as _count and sorts by _count in descending order. We can specify the order to customize sorting:
1 GET /hotel/_search 2 { 3 "size": 0, 4 "aggs": { 5 "brandAgg": { 6 "terms": { 7 "field": "brand", 8 "order": { 9 "_count": "asc" // sort by _count in ascending order 10 }, 11 "size": 20 12 } 13 } 14 } 15 }
Limiting the aggregation scope#
By default, Bucket aggregations run over all documents in the index, but in real scenarios users provide search criteria, so aggregations should be limited to the search results.
You can restrict the documents to be aggregated by adding a query condition:
1 GET /hotel/_search 2 { 3 "query": { 4 "range": { 5 "price": { 6 "lte": 200 // aggregate only documents with price <= 200 7 } 8 } 9 }, 10 "size": 0, 11 "aggs": { 12 "brandAgg": { 13 "terms": { 14 "field": "brand", 15 "size": 20 16 } 17 } 18 } 19 }
Metric aggregation syntax#
Now we want to perform calculations within each bucket, such as the min, max, and average user scores per brand.
This uses Metric aggregations, e.g., stats, to obtain min, max, avg, etc.
1 GET /hotel/_search 2 { 3 "size": 0, 4 "aggs": { 5 "brandAgg": { 6 "terms": { 7 "field": "brand", 8 "size": 20 9 }, 10 "aggs": { // a sub-aggregation for each brand 11 "score_stats": { // aggregation name 12 "stats": { // type of aggregation to compute 13 "field": "score" // aggregation field 14 } 15 } 16 } 17 } 18 } 19 }
Here, the score_stats aggregation is nested inside the brandAgg aggregation, since we want to compute it for each bucket.
Aggregations are defined at the same level as the query; the query’s role is to:

Limit the documents that participate in the aggregation

Three essential elements of an aggregation:

Aggregation name

Aggregation type

Aggregation field

Configurable properties include:

size: specify the number of aggregation results

order: specify the order of the aggregation results

field: specify the aggregation field

RestAPI implementation of aggregations#
Aggregation conditions are at the same level as the query, so you specify them via request.source().
Using aggregations, bucket aggregations group documents in the search results by brand or by city, so you can know which brands and which cities exist.
Because the aggregation is performed on search results, it is a scoped aggregation; its scope matches the search document criteria.
1 @Override 2 public Map<String, List<String>> filters(RequestParams params) { 3 try { 4 // 1. Prepare Request 5 SearchRequest request = new SearchRequest("hotel"); 6 // 2. Prepare DSL 7 // 2.1. query 8 buildBasicQuery(params, request); 9 // 2.2. set size 10 request.source().size(0); 11 // 2.3. aggregation 12 buildAggregation(request); 13 // 3. Send request 14 SearchResponse response = client.search(request, RequestOptions.DEFAULT); 15 // 4. Parse results 16 Map<String, List<String>> result = new HashMap<>(); 17 Aggregations aggregations = response.getAggregations(); 18 // 4.1. Get brand results 19 List<String> brandList = getAggByName(aggregations, "brandAgg"); 20 result.put("Brand", brandList); 21 // 4.2. Get city results 22 List<String> cityList = getAggByName(aggregations, "cityAgg"); 23 result.put("City", cityList); 24 // 4.3. Get star results 25 List<String> starList = getAggByName(aggregations, "starAgg"); 26 result.put("Star", starList); 27 28 return result; 29 } catch (IOException e) { 30 throw new RuntimeException(e); 31 } 32 } 33 34 private void buildAggregation(SearchRequest request) { 35 request.source().aggregation(AggregationBuilders 36 .terms("brandAgg") 37 .field("brand") 38 .size(100) 39 ); 40 request.source().aggregation(AggregationBuilders 41 .terms("cityAgg") 42 .field("city") 43 .size(100) 44 ); 45 request.source().aggregation(AggregationBuilders 46 .terms("starAgg") 47 .field("starName") 48 .size(100) 49 ); 50 } 51 52 private List<String> getAggByName(Aggregations aggregations, String aggName) { 53 // 4.1 Get the aggregation by name 54 Terms brandTerms = aggregations.get(aggName); 55 // 4.2 Get buckets 56 List<? extends Terms.Bucket> buckets = brandTerms.getBuckets(); 57 // 4.3 Iterate 58 List<String> brandList = new ArrayList<>(); 59 for (Terms.Bucket bucket : buckets) { 60 // 4.4 Get key 61 String key = bucket.getKeyAsString(); 62 brandList.add(key); 63 } 64 return brandList; 65 }

Auto-completion#
When users type characters in the search box, we should suggest items related to the input; this is auto-complete, which suggests complete terms from partial input.
Pinyin-based tokenizer#
To implement prefix-based completion, documents must be tokenized using Pinyin. There is a Elasticsearch pinyin tokenizer plugin on GitHub.
1 docker exec -it es bash 2 3 ./bin/elasticsearch-plugin install <https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.12.1/elasticsearch-analysis-pinyin-7.12.1.zip> 4 5 exit 6 #Restart the container 7 docker restart elasticsearch
Custom analyzers#
The default pinyin analyzer tokenizes each Chinese character individually; we want a set of pinyin terms to form a group of terms, so we need to customize the pinyin tokenizer to create a custom analyzer.
An analyzer in Elasticsearch consists of three parts:

character filters: preprocess text before tokenization (e.g., removing or replacing characters)

tokenizer: splits text into terms. Examples: keyword (no tokenization) and ik_smart

tokenizer filters: further process tokens, such as case conversion, synonyms, or pinyin processing

Tokenization proceeds through these three components for documents:
1 PUT /test 2 { 3 "settings": { 4 "analysis": { 5 "analyzer": { // Custom analyzer 6 "my_analyzer": { // Analyzer name 7 "tokenizer": "ik_max_word", 8 "filter": "py" 9 } 10 }, 11 "filter": { // Custom tokenizer filter 12 "py": { // Filter name 13 "type": "pinyin", // Filter type 14 "keep_full_pinyin": false, 15 "keep_joined_full_pinyin": true, 16 "keep_original": true, 17 "limit_first_letter_length": 16, 18 "remove_duplicated_term": true, 19 "none_chinese_pinyin_tokenize": false 20 } 21 } 22 } 23 }, 24 "mappings": { 25 "properties": { 26 "name": { 27 "type": "text", 28 "analyzer": "my_analyzer", 29 "search_analyzer": "ik_smart" 30 } 31 } 32 } 33 }
Auto-complete query#
Elasticsearch provides the Completion Suggester to implement auto-completion. This query matches terms that start with the user input and returns them. To improve efficiency, there are constraints on the field types used for completion:

The field participating in completion queries must be of type completion.

The content is typically an array of terms used for completion.

Implementation of auto-completion:
1 @Override 2 public List<String> getSuggestions(String prefix) { 3 try { 4 // 1. Prepare Request 5 SearchRequest request = new SearchRequest("hotel"); 6 // 2. Prepare DSL 7 request.source().suggest(new SuggestBuilder().addSuggestion( 8 "suggestions", 9 SuggestBuilders.completionSuggestion("suggestion") 10 .prefix(prefix) 11 .skipDuplicates(true) 12 .size(10) 13 )); 14 // 3. Send request 15 SearchResponse response = client.search(request, RequestOptions.DEFAULT); 16 // 4. Parse results 17 Suggest suggest = response.getSuggest(); 18 // 4.1 Get suggestions by name 19 CompletionSuggestion suggestions = suggest.getSuggestion("suggestions"); 20 // 4.2 Get options 21 List<CompletionSuggestion.Entry.Option> options = suggestions.getOptions(); 22 // 4.3 Iterate 23 List<String> list = new ArrayList<>(options.size()); 24 for (CompletionSuggestion.Entry.Option option : options) { 25 String text = option.getText().toString(); 26 list.add(text); 27 } 28 return list; 29 } catch (IOException e) { 30 throw new RuntimeException(e); 31 } 32 }

Data synchronization#
Elasticsearch hotel data comes from a MySQL database, so when MySQL data changes, Elasticsearch must be updated as well. This is the data synchronization between Elasticsearch and MySQL.
There are three common approaches:

Synchronous invocation

hotel-demo exposes an API to modify Elasticsearch data

The hotel management service calls the hotel-demo API after performing DB operations

Asynchronous notification

The hotel-admin service emits MQ messages after MySQL insert/update/delete

The hotel-demo listens for MQ messages and updates Elasticsearch accordingly

Binlog listening

Enable MySQL binlog

All insert, update, delete operations are logged in binlog

hotel-demo listens to binlog changes via Canal and updates Elasticsearch in real time

Approach 1: Synchronous invocation

Pros: simple, crude

Cons: tight coupling between services

Approach 2: Asynchronous notification

Pros: low coupling, moderate implementation difficulty

Cons: depends on MQ reliability

Approach 3: Binlog listening

Pros: completely decouples services

Cons: enabling binlog adds DB overhead; implementation is complex

Implementing data synchronization#
Use the pre-course material’s hotel-admin project as the hotel management service. When hotel data is added, deleted, or updated, Elasticsearch data should be updated accordingly.

Start and test hotel data CRUD

Declare exchanges, queues, RoutingKeys

In hotel-admin’s add/delete/update operations, publish messages

In hotel-demo, implement message listening and update Elasticsearch data

Start and test data synchronization

Declare exchanges and queues#
MQ structure as follows:
Add dependencies
1  2 <dependency> 3 <groupId>org.springframework.boot</groupId> 4 <artifactId>spring-boot-starter-amqp</artifactId> 5 </dependency>
Define configuration class to declare the beans
1 import cn.itcast.hotel.constants.MqConstants; 2 import org.springframework.amqp.core.Binding; 3 import org.springframework.amqp.core.BindingBuilder; 4 import org.springframework.amqp.core.Queue; 5 import org.springframework.amqp.core.TopicExchange; 6 import org.springframework.context.annotation.Bean; 7 import org.springframework.context.annotation.Configuration; 8 9 @Configuration 10 public class MqConfig { 11 @Bean 12 public TopicExchange topicExchange(){ 13 return new TopicExchange(MqConstants.HOTEL_EXCHANGE, true, false); 14 } 15 16 @Bean 17 public Queue insertQueue(){ 18 return new Queue(MqConstants.HOTEL_INSERT_QUEUE, true); 19 } 20 21 @Bean 22 public Queue deleteQueue(){ 23 return new Queue(MqConstants.HOTEL_DELETE_QUEUE, true); 24 } 25 26 @Bean 27 public Binding insertQueueBinding(){ 28 return BindingBuilder.bind(insertQueue()).to(topicExchange()).with(MqConstants.HOTEL_INSERT_KEY); 29 } 30 31 @Bean 32 public Binding deleteQueueBinding(){ 33 return BindingBuilder.bind(deleteQueue()).to(topicExchange()).with(MqConstants.HOTEL_DELETE_KEY); 34 } 35 }
In the add, delete, and update operations in hotel-admin, MQ messages are sent respectively:
Sending MQ messages#
1 @PostMapping 2 public void saveHotel(@RequestBody Hotel hotel){ 3 hotelService.save(hotel); 4 5 rabbitTemplate.convertAndSend(MqConstants.HOTEL_EXCHANGE,HOTEL_INSERT_KEY,hotel.getId()); 6 } 7 8 @PutMapping() 9 public void updateById(@RequestBody Hotel hotel){ 10 if (hotel.getId() == null) { 11 throw new InvalidParameterException("id不能为空"); 12 } 13 hotelService.updateById(hotel); 14 15 rabbitTemplate.convertAndSend(MqConstants.HOTEL_EXCHANGE,HOTEL_INSERT_KEY,hotel.getId()); 16 } 17 18 @DeleteMapping("/{id}") 19 public void deleteById(@PathVariable("id") Long id) { 20 hotelService.removeById(id); 21 22 rabbitTemplate.convertAndSend(MqConstants.HOTEL_EXCHANGE,MqConstants.HOTEL_DELETE_KEY, id); 23 }
Receiving MQ messages#
Create a listener
In hotel-demo under the cn.itcast.hotel.mq package, add a class:
1 @Component 2 public class HotelListener { 3 4 @Autowired 5 private IHotelService hotelService; 6 7 /** 8 * Listen for hotel add or update operations 9 * @param id hotel id 10 */ 11 @RabbitListener(queues = MqConstants.HOTEL_INSERT_QUEUE) 12 public void listenHotelInsertOrUpdate(Long id){ 13 hotelService.insertById(id); 14 } 15 16 /** 17 * Listen for hotel deletion 18 * @param id hotel id 19 */ 20 @RabbitListener(queues = MqConstants.HOTEL_DELETE_QUEUE) 21 public void listenHotelDelete(Long id){ 22 hotelService.deleteById(id); 23 } 24 }
Implementing the business logic:
1 @Override 2 public void deleteById(Long id) { 3 try { 4 // 1. Prepare Request 5 DeleteRequest request = new DeleteRequest("hotel", id.toString()); 6 // 2. Send request 7 client.delete(request, RequestOptions.DEFAULT); 8 } catch (IOException e) { 9 throw new RuntimeException(e); 10 } 11 } 12 13 @Override 14 public void insertById(Long id) { 15 try { 16 // 0. Query hotel data by id 17 Hotel hotel = getById(id); 18 // Convert to document type 19 HotelDoc hotelDoc = new HotelDoc(hotel); 20 21 // 1. Prepare Request object 22 IndexRequest request = new IndexRequest("hotel").id(hotel.getId().toString()); 23 // 2. Prepare JSON document 24 request.source(JSON.toJSONString(hotelDoc), XContentType.JSON); 25 // 3. Send request 26 client.index(request, RequestOptions.DEFAULT); 27 } catch (IOException e) { 28 throw new RuntimeException(e); 29 } 30 }

Clusters#
Running Elasticsearch on a single machine inevitably faces two issues: handling massive data and single point of failure.

Mass data storage: shard the index into several pieces and store across multiple nodes

Single point of failure: back up shards on different nodes (replicas)

ES cluster concepts:

Cluster: A set of nodes that share the same cluster name

Node: A single Elasticsearch instance in the cluster

Shard: An index can be partitioned into parts; in a cluster, different shards can reside on different nodes

Primary shard: as defined relative to replica shards

Replica shard: Each primary shard can have one or more replicas; data and primary shards are replicated

Data backups provide high availability but the more replicas you have, the more nodes you need, which increases cost
To balance availability and cost, you can:

Shard data to different nodes

Then back up each shard on the other nodes, achieving mutual backup

This can significantly reduce the number of service nodes required
Creating an ES cluster#
Using docker-compose:
1 version: '2.2' 2 services: 3 es01: 4 image: elasticsearch:7.12.1 5 container_name: es01 6 environment: 7 - node.name=es01 8 - cluster.name=es-docker-cluster 9 - discovery.seed_hosts=es02,es03 10 - cluster.initial_master_nodes=es01,es02,es03 11 - "ES_JAVA_OPTS=-Xms512m -Xmx512m" 12 volumes: 13 - data01:/usr/share/elasticsearch/data 14 ports: 15 - 9200:9200 16 networks: 17 - elastic 18 es02: 19 image: elasticsearch:7.12.1 20 container_name: es02 21 environment: 22 - node.name=es02 23 - cluster.name=es-docker-cluster 24 - discovery.seed_hosts=es01,es03 25 - cluster.initial_master_nodes=es01,es02,es03 26 - "ES_JAVA_OPTS=-Xms512m -Xmx512m" 27 volumes: 28 - data02:/usr/share/elasticsearch/data 29 ports: 30 - 9201:9200 31 networks: 32 - elastic 33 es03: 34 image: elasticsearch:7.12.1 35 container_name: es03 36 environment: 37 - node.name=es03 38 - cluster.name=es-docker-cluster 39 - discovery.seed_hosts=es01,es02 40 - cluster.initial_master_nodes=es01,es02,es03 41 - "ES_JAVA_OPTS=-Xms512m -Xmx512m" 42 volumes: 43 - data03:/usr/share/elasticsearch/data 44 networks: 45 - elastic 46 ports: 47 - 9202:9200 48 volumes: 49 data01: 50 driver: local 51 data02: 52 driver: local 53 data03: 54 driver: local 55 56 networks: 57 elastic: 58 driver: bridge
If running on WSL makes it difficult to start, increase memory with:
1 wsl -d docker-desktop 2 echo 262144 >> /proc/sys/vm/max_map_count
Monitor the ES cluster with Cerebro
Cluster split-brain#
Cluster role separation#
In Elasticsearch, cluster nodes have different roles:
By default, any node in the cluster can assume all four roles.
In real deployments, roles should be separated:

Master node: high CPU requirements, memory needs as well

Data node: high CPU and memory requirements

Coordinating node: high network bandwidth and CPU requirements

Role separation allows you to allocate different hardware to different nodes and avoid cross-service interference.
Split-brain#
Split-brain occurs when nodes in the cluster lose contact.
When the network recovers, if there are two masters, the cluster state may diverge, causing split-brain:
The solution is to require consensus votes greater than (eligible nodes + 1) / 2 to elect a master; hence an odd number of eligible nodes is preferable. The configuration is discovery.zen.minimum_master_nodes. After ES 7.0, this is usually on by default, so split-brain is rarely an issue.
What is the role of master-eligible nodes?

Participate in master election

Master nodes manage the cluster state, shard information, and requests to create/delete indices

What is the role of data nodes?

CRUD operations on data

What is the role of coordinating nodes?

Route requests to other nodes

Merge results from different nodes and return to the user

Cluster distributed storage#
When adding a new document, it should be stored on different shards to balance data. How does the coordinating node decide which shard to store data on?
Sharding principle
Elasticsearch uses a hash function to determine which shard a document should be stored on:
shard = hash(_routing) % number_of_shards
Notes:

_routing defaults to the document id

The algorithm depends on the number of shards; once an index is created, the shard count cannot be changed

Cluster distributed querying#
Elasticsearch queries operate in two phases:

scatter phase: the coordinating node distributes the request to each shard

gather phase: the coordinating node collects results from data nodes and returns the final results to the user

Cluster failover#
The cluster’s master node monitors node status. If a node fails, the master will immediately relocate the failed node’s shards to other nodes to ensure data safety; this is failover.

node1 is the master, the other two nodes are replicas

node1 fails; a new master is elected, for example node2

node2 detects the cluster state and finds shards-1 and -0 have no replica nodes

migrate data from node1 to node2 and node3

Share

If this article helped you, please share it with others!

Getting Started with Elasticsearch

https://dreaife.tokyo/en/posts/elasticsearch-basics/

Author

dreaife

Published at

2023-08-13

License

CC BY-NC-SA 4.0

Some information may be outdated

Python Web Crawler Environment Setup

Getting Started with RabbitMQ

dreaife的休憩小栈

Getting Started with Elasticsearch#

Understanding ES#

The role of Elasticsearch#

ELK Stack#

Elasticsearch and Lucene#

Inverted Index#

Forward Index#

Inverted Index#

Forward vs Inverted#

Some concepts in ES#

Documents and Fields#

Index and Mapping#

MySQL vs Elasticsearch#

Installation#

Install Elasticsearch and Kibana#

Install IK Analyzer#

Index management#

Mapping properties#

CRUD for index management#

Create index and mapping#

Query index#

Modify index#

Delete index#

Document operations#

Create a new document#

Query a document#

Delete a document#

Update a document#

Bulk import documents#

DSL Querying documents#

DSL query categories#

Full-text search#

Exact queries#

Geo queries#

Compound queries#

Relevance scoring#

Function score queries#

Bool query#

Processing search results#

Sorting#

Pagination#

Highlighting#

RestClient query documentation#

Quick start#

match query#

Exact queries#

Bool queries#

Sorting, pagination#

Highlighting#

Heima Travel Case#

Hotel search and pagination#

Define entity classes#

Define controller#

Implement search logic#

Hotel results filtering#

My Nearby Hotels#

Hotel bidding ranking#

Data Aggregation#

Types of aggregations#

Implementing aggregations with DSL#

Bucket aggregation syntax#

Sorting aggregation results#

Limiting the aggregation scope#

Metric aggregation syntax#

RestAPI implementation of aggregations#

Auto-completion#

Pinyin-based tokenizer#

Custom analyzers#

Auto-complete query#

Data synchronization#

Implementing data synchronization#

Declare exchanges and queues#

Sending MQ messages#

Receiving MQ messages#

Clusters#

Creating an ES cluster#

Cluster split-brain#

Cluster role separation#

Split-brain#