1、什么是分词器
切分词语,normalization(提升recall召回率)
给你一段句子,然后将这段句子拆分成一个一个的单个的单词,同时对每个单词进行normalization(时态转换,单复数转换)
recall,召回率:搜索的时候,增加能够搜索到的结果的数量
character filter:在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤html标签(<span>hello<span> --> hello),& --> and(I&you --> I and you)
tokenizer:分词,hello you and me --> hello, you, and, me
token filter:lowercase,stop word,synonymom,dogs --> dog,liked --> like,Tom --> tom,a/the/an --> 干掉,mother --> mom,small --> little
一个分词器,很重要,将一段文本进行各种处理,最后处理好的结果才会拿去建立倒排索引
2、内置分词器的介绍
Set the shape to semi-transparent by calling set_trans(5)
standard analyzer:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(默认的是standard)
simple analyzer:set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer(特定的语言的分词器,比如说,english,英语分词器):set, shape, semi, transpar, call, set_tran, 5
3、测试分词器
GET /_analyze
{
"analyzer": "standard",
"text": "Text to analyze"
}
GET /_analyze
{
"analyzer": "english",
"text": "Text to analyze"
}
1、默认的分词器
standard
standard tokenizer:以单词边界进行切分
standard token filter:什么都不做
lowercase token filter:将所有字母转换为小写
stop token filer(默认被禁用):移除停用词,比如a the it等等
2、修改分词器的设置
启用english停用词token filter
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"es_std": { //自己起的名字
"type": "standard",
"stopwords": "_english_" //启用英语移除停用词
}
}
}
}
}
测试
GET /my_index/_analyze
{
"analyzer": "standard",
"text": "a dog is in the house"
}
GET /my_index/_analyze
{
"analyzer": "es_std",
"text":"a dog is in the house"
}
3、定制化自己的分词器
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": { //定义字符转换名称
"&_to_and": {
"type": "mapping",//映射
"mappings": ["&=> and"]
}
},
"filter": {
"my_stopwords": { //定义移除停用词名称
"type": "stop",//停用词
"stopwords": ["the", "a"]
}
},
"analyzer": {
"my_analyzer": { //自定义需要使用的分词器名称
"type": "custom", //自定义
"char_filter": ["html_strip", "&_to_and"], //html_strip(应该是内置)表示移除html,&_to_and表示&转换成and
"tokenizer": "standard", //基础默认的分词器
"filter": ["lowercase", "my_stopwords"] //lowercase表示内置的大小写转换,my_stopwords这个是自定义的移除停用词
}
}
}
}
}
//自定义的分词器分析
GET /my_index/_analyze
{
"text": "tom&jerry are a friend in the house, <a>, HAHA!!",
"analyzer": "my_analyzer"
}
//使用自定义的分词器
PUT /my_index/_mapping/my_type
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}