elasticsearch分词器-白红宇

elasticsearch分词器

阅读量：7040 次

发布时间：2019-06-28

本文共 2378 字，大约阅读时间需要 7 分钟。

1、什么是分词器

切分词语，normalization（提升recall召回率）

给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换）

recall，召回率：搜索的时候，增加能够搜索到的结果的数量

character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you）

tokenizer：分词，hello you and me --> hello, you, and, me

token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引

2、内置分词器的介绍

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）

simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans

whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5

3、测试分词器

GET /_analyze

{

"analyzer": "standard",

"text": "Text to analyze"

}

GET /_analyze

{

"analyzer": "english",

"text": "Text to analyze"

}

1、默认的分词器

standard

standard tokenizer：以单词边界进行切分

standard token filter：什么都不做

lowercase token filter：将所有字母转换为小写

stop token filer（默认被禁用）：移除停用词，比如a the it等等

2、修改分词器的设置

启用english停用词token filter

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"es_std": { //自己起的名字

"type": "standard",

"stopwords": "_english_" //启用英语移除停用词

}

测试

GET /my_index/_analyze

{

"analyzer": "standard",

"text": "a dog is in the house"

}

GET /my_index/_analyze

{

"analyzer": "es_std",

"text":"a dog is in the house"

}

3、定制化自己的分词器

PUT /my_index

{

"settings": {

"analysis": {

"char_filter": { //定义字符转换名称

"&_to_and": {

"type": "mapping",//映射

"mappings": ["&=> and"]

}

"filter": {

"my_stopwords": { //定义移除停用词名称

"type": "stop",//停用词

"stopwords": ["the", "a"]

}

"analyzer": {

"my_analyzer": { //自定义需要使用的分词器名称

"type": "custom", //自定义

"char_filter": ["html_strip", "&_to_and"], //html_strip(应该是内置)表示移除html，&_to_and表示&转换成and

"tokenizer": "standard", //基础默认的分词器

"filter": ["lowercase", "my_stopwords"] //lowercase表示内置的大小写转换，my_stopwords这个是自定义的移除停用词

}

//自定义的分词器分析

GET /my_index/_analyze

{

"text": "tom&jerry are a friend in the house, <a>, HAHA!!",

"analyzer": "my_analyzer"

}

//使用自定义的分词器

PUT /my_index/_mapping/my_type

{

"properties": {

"content": {

"type": "text",

"analyzer": "my_analyzer"

}

转载于:https://www.cnblogs.com/kesimin/p/9559968.html

你可能感兴趣的文章

linux shell单引号、双引号及无引号区别（考试题答案系列）--看到这篇文章之后我豁然开朗...

查看>>

排错 zabbix-agent 主机重启无法被监控

用jdbcTempate调用存储过程,处理BLOBCLOB小记

释放LINUX内存（请使用火狐浏览器浏览本页面）

查看>>

Andrew Ng 深度学习笔记-01-week3-课程

查看>>

Android获取通过XML设置的空间的高宽

查看>>

生活的苦逼

查看>>

在iptables防火墙下开启vsftpd的端口

查看>>

Mysql、MariaDB 新型主从集群配置GTID

查看>>

Linux HA Cluster的实例演示（2）

查看>>

Javascript Closure

查看>>