您的位置:首页 > 娱乐 > 明星 > 装修公司简介_餐饮行业管理系统_免费信息推广平台_seo薪资

装修公司简介_餐饮行业管理系统_免费信息推广平台_seo薪资

2024/12/22 15:34:27 来源:https://blog.csdn.net/qq_26112725/article/details/144632366  浏览:    关键词:装修公司简介_餐饮行业管理系统_免费信息推广平台_seo薪资
装修公司简介_餐饮行业管理系统_免费信息推广平台_seo薪资

1.需求:

现在有个原始文本:"电阻@CAL-CHIP@55@55w@±1%@330Ω@1/8W@55℃@xiaolong(电脑)@3000HF",如果使用官方分词器,分词之后的结果如下:

{"tokens": [{"token": "电","start_offset": 0,"end_offset": 1,"type": "<IDEOGRAPHIC>","position": 0},{"token": "阻","start_offset": 1,"end_offset": 2,"type": "<IDEOGRAPHIC>","position": 1},{"token": "cal","start_offset": 3,"end_offset": 6,"type": "<ALPHANUM>","position": 2},{"token": "chip","start_offset": 7,"end_offset": 11,"type": "<ALPHANUM>","position": 3},{"token": "55","start_offset": 12,"end_offset": 14,"type": "<NUM>","position": 4},{"token": "55w","start_offset": 15,"end_offset": 18,"type": "<ALPHANUM>","position": 5},{"token": "1","start_offset": 20,"end_offset": 21,"type": "<NUM>","position": 6},{"token": "330ω","start_offset": 23,"end_offset": 27,"type": "<ALPHANUM>","position": 7},{"token": "1","start_offset": 28,"end_offset": 29,"type": "<NUM>","position": 8},{"token": "8w","start_offset": 30,"end_offset": 32,"type": "<ALPHANUM>","position": 9},{"token": "55","start_offset": 33,"end_offset": 35,"type": "<NUM>","position": 10},{"token": "xiaolong","start_offset": 37,"end_offset": 45,"type": "<ALPHANUM>","position": 11},{"token": "电","start_offset": 46,"end_offset": 47,"type": "<IDEOGRAPHIC>","position": 12},{"token": "脑","start_offset": 47,"end_offset": 48,"type": "<IDEOGRAPHIC>","position": 13},{"token": "3000hf","start_offset": 50,"end_offset": 56,"type": "<ALPHANUM>","position": 14}]
}

可以看到,这个分词之后的结果是以:@-±%/ 这些符号为界限把词给分开,并且遇到中文的时候,分词的结果是要逐字分词,现在我需要实现的效果是:原来分词的效果不变,只是不能把 ℃ 这个符号给去掉,也就是说理想的结果是这样的

 ........
{"token": "8w","start_offset": 30,"end_offset": 32,"type": "<ALPHANUM>","position": 9},{"token": "55℃","start_offset": 33,"end_offset": 35,"type": "<NUM>","position": 10},{"token": "xiaolong","start_offset": 37,"end_offset": 45,"type": "<ALPHANUM>","position": 11},
.....

2.通过自定义分词器来实现:

{"settings": {"analysis": {"analyzer": {"custom_analyzer": {"type": "custom","tokenizer": "combined_tokenizer","char_filter": ["chinese_space_char_filter"]}},"tokenizer": {"combined_tokenizer": {"type": "pattern","pattern": ["-|@|,|!|?|=|/|±| |(|)|?"]}},"char_filter": {"chinese_space_char_filter": {"type": "pattern_replace","pattern": "([\\u4e00-\\u9fa5])","replacement": " $1 "}}}},"mappings": {"properties": {"parameterSplicing": {"type": "text","analyzer": "custom_analyzer","index": true,"store": false}}}}

通过这个来作为分词器,然后分词效果如下:

{"analyzer": "custom_analyzer","text": "电阻@CAL-CHIP@55@55w@±1%@330Ω@1/8W@55℃@xiaolong(小龙牌电脑)@3000HF"
}分词结果:
{"tokens": [{"token": "电","start_offset": 0,"end_offset": 0,"type": "word","position": 0},{"token": "阻","start_offset": 1,"end_offset": 1,"type": "word","position": 1},{"token": "CAL","start_offset": 3,"end_offset": 6,"type": "word","position": 2},{"token": "CHIP","start_offset": 7,"end_offset": 11,"type": "word","position": 3},{"token": "55","start_offset": 12,"end_offset": 14,"type": "word","position": 4},{"token": "55w","start_offset": 15,"end_offset": 18,"type": "word","position": 5},{"token": "1%","start_offset": 20,"end_offset": 22,"type": "word","position": 6},{"token": "330Ω","start_offset": 23,"end_offset": 27,"type": "word","position": 7},{"token": "1","start_offset": 28,"end_offset": 29,"type": "word","position": 8},{"token": "8W","start_offset": 30,"end_offset": 32,"type": "word","position": 9},{"token": "55℃","start_offset": 33,"end_offset": 36,"type": "word","position": 10},{"token": "xiaolong","start_offset": 37,"end_offset": 45,"type": "word","position": 11},{"token": "小","start_offset": 46,"end_offset": 46,"type": "word","position": 12},{"token": "龙","start_offset": 47,"end_offset": 47,"type": "word","position": 13},{"token": "牌","start_offset": 48,"end_offset": 48,"type": "word","position": 14},{"token": "电","start_offset": 49,"end_offset": 49,"type": "word","position": 15},{"token": "脑","start_offset": 50,"end_offset": 50,"type": "word","position": 16},{"token": "3000HF","start_offset": 53,"end_offset": 59,"type": "word","position": 17}]
}

可以看到,这个 55℃ 被完整的保留下来了, 

3.解释:

1. settings 部分

  • analysis 节点:这是整个分析器相关配置的核心节点,用于定义各种分析组件,像分析器(analyzer)、分词器(tokenizer)以及字符过滤器(char_filter)等。
    • analyzer 节点(自定义分析器定义)
      • custom_analyzer:这是自定义的一个分析器名称,它的类型被指定为 custom,意味着需要自行组合各种组件(分词器、字符过滤器等)来构建其功能。
      • 组件配置:它使用了名为 combined_tokenizer 的分词器,并且关联了一个名为 chinese_space_char_filter 的字符过滤器。通过这样的搭配,文本在经过这个分析器处理时,会先由 combined_tokenizer 进行初步的分词操作,然后再经过 chinese_space_char_filter 做进一步的文本处理(后文会详细介绍具体处理内容)。
    • tokenizer 节点(分词器定义)
      • combined_tokenizer:具体定义了一个名为 combined_tokenizer 的分词器,其类型是 pattern,也就是基于正则表达式模式来进行分词操作。
      • pattern 配置:其 pattern 属性值为 -|@|,|!|?|=|/|±| |(|)|?,这是一个正则表达式模式,含义是以 “或” 的关系罗列了一系列用于分词的标识符号。具体来说,文本中一旦出现 -@!?=/±、空格、() 这些符号中的任意一个,分词器就会在该符号出现的位置将文本分割成不同的词项。例如,对于文本 "电阻@CAL-CHIP@0805@±1%@330Ω@1/8W@55℃" ,就会依据这些符号进行相应的拆分,像根据 @ 把各个不同部分拆分开等。
    • char_filter 节点(字符过滤器定义)
      • chinese_space_char_filter:定义了一个字符过滤器,类型为 pattern_replace,主要用于对文本中的中文字符进行特定处理。
      • pattern 和 replacement 配置pattern 属性值为 ([\\u4e00-\\u9fa5]),这是利用 Unicode 编码范围来匹配任意单个中文字符,并且使用括号进行了分组捕获;replacement 属性值为 $1,表示将匹配到的单个中文字符(也就是前面分组捕获的内容)前后都添加一个空格。这样做的目的是在分词之后,对于文本里出现的中文部分,每个中文字符都能以添加前后空格的形式存在,方便后续可能的进一步文本处理或者索引、查询匹配等操作,使其在文本结构上更清晰、便于区分。



在这里需要重点说明一下,之所以遇到中文可以逐字分词,那是因为通过字符过滤器,在分词之前把中文的每一个字前后都加上了空格,然后在分词器里面有定义:遇到空格就进行分词,所以就可以做到分词之后的效果是逐字分词

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com