solr - SOLR 和重音字符

Question

我有一个职业索引（标识符+职业）：

<field name="occ_id" type="int" indexed="true" stored="true" required="true" />
<field name="occ_tx_name" type="text_es" indexed="true" stored="true" multiValued="false" />


<!-- Spanish -->
<fieldType name="text_es" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
    <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>

这是一个真实的查询，针对三个标识符（1、195 和 129）：

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_id:1+occ_id:195+occ_id:129&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"occ_id:1 occ_id:195 occ_id:129",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "occ_id":1,
        "occ_tx_name":"Abogado",
        "_version_":1565225103805906944},
      {
        "occ_id":129,
        "occ_tx_name":"Informático",
        "_version_":1565225103843655680},
      {
        "occ_id":195,
        "occ_tx_name":"Osteópata",
        "_version_":1565225103858335746}]
  }}

其中两个有重音字符，一个没有。因此，让我们在不使用重音符号的情况下按 occ_tx_name 进行搜索：

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:abogado&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"occ_tx_name:abogado",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "occ_id":1,
        "occ_tx_name":"Abogado",
        "_version_":1565225103805906944}]
  }}

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:informatico&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"occ_tx_name:informatico",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound”:1,”start":0,"docs":[
      {
        "occ_id":129,
        "occ_tx_name":"Informático",
        "_version_":1565225103843655680}]
  }}


curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:osteopata&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"occ_tx_name:osteopata",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":0,"start":0,"docs":[]
  }}

我对最后一次搜索“osteopata”失败而“informatico”成功的事实感到非常恼火。索引的源数据是一个简单的 MySQL 表：

-- -----------------------------------------------------
-- Table `mydb`.`occ_occupation`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`occ_occupation` (
  `occ_id` INT UNSIGNED NOT NULL,
  `occ_tx_name` VARCHAR(255) NOT NULL,
  PRIMARY KEY (`occ_id`)
ENGINE = InnoDB

该表的排序规则是“utf8mb4_general_ci”。索引是使用 DataImportHandler 创建的。这是定义：

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://192.168.1.11:3306/mydb"
        user=“mydb” password=“mydb” />
    <document name="occupations">
        <entity name="occupation" pk="occ_id"
            query="SELECT occ.occ_id, occ.occ_tx_name FROM occ_occupation occ WHERE occ.sta_bo_deleted = false">
            <field column="occ_id" name="occ_id" />
            <field column="occ_tx_name" name="occ_tx_name" />
        </entity>
    </document>
</dataConfig>

我需要一些线索来检测问题。谁能帮我？提前致谢。

score 1 · Accepted Answer

只需添加solr.ASCIIFoldingFilterFactory到您的过滤器分析器链，甚至更好地创建一个新的 fieldType：

<!-- Spanish -->
<fieldType name="text_es_ascii_folding" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
    <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>

此过滤器将不在基本拉丁 Unicode 块（前 127 个 ASCII 字符）中的字母、数字和符号 Unicode 字符转换为它们的 ASCII 等效字符（如果存在）。

即使缺少重音字符，这也应该让您匹配搜索。缺点是像“cañon”和“canon”这样的词现在是等价的并且都击中了相同的文件IIRC。

score 0 · Accepted Answer

好的，我发现了源问题。我已经用 VI 以十六进制模式打开了我的 SQL 加载脚本。

这是 INSERT 语句中“Agrónomo”的十六进制内容：41 67 72 6f cc 81 6e 6f 6d 6f。

6f cc 81!!!! This is "o COMBINING ACUTE ACCENT" UTF code!!!!

所以这就是问题所在......它必须是“c3 b3”......我从网页中复制/粘贴文字，所以来源上的源字符是问题所在。

感谢你们两位，因为我对 SOLR 的灵魂有了更多的了解。

问候。

score 0 · Accepted Answer

我认为 mysql 或您的 jvm 设置与此无关。我怀疑一个有效，另一个可能不是由于 SpanishLightStemFilterFactory。

无论变音符号如何，实现匹配的正确方法是使用以下内容：

  <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

将它放在索引和查询分析器链中的标记器之前，任何变音符号都应转换为 ascii 版本。这将使它始终有效。

solr - SOLR 和重音字符

3 回答 3

Related

Reference