parsing - Elasticsearch 正确的策略来索引 html 文件的内容

Question

您好 Elasticsearch 专家！

我有一个用例，我不确定最好的方法是什么。

我有一个需要索引的 html 文件。这部分很简单，因为我可以配置我的自定义分析器并可以创建索引。

虽然我有一个特殊的需要，我需要在索引到特殊字段的过程中提取一些数据。

这是从具有数千行这样的行的 html 中提取的。

<td>....</td>
<td>...
<p>Great item to truck</p></td>...
<a href="javascript:selectItem('1.a.b.c.1.d.f.11')">1.a.b.c.1.d.f.11</a> ...

大量垃圾，甚至内联 CSS。

我的局限：

我没有办法改变html

我的挑战：

索引 html 文件的文本，同时删除 html 标签 css 和噪声
我需要在作为 LINK 一部分的文本上创建自动补全，例如 1.abc1.df11

所以当用户开始输入 1.abc1.df11 我必须能够自动完成它。

我应该创建一个分析器来剥离除标签内容之外的所有内容。如果是这样，我该怎么做？

我将不胜感激任何评论或暗示您认为使用 elasticsearch 的正确方法

score 3 · Accepted Answer

解决方案1：

我建议您开发一个小型应用程序来解析 html 文件内容并仅索引您感兴趣的数据。换句话说，剥离所有 html 标签和不必要的数据

解决方案 2

您可以使用 char 过滤器 [html_strip] 去除所有 html 标签

GET /_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip&text=<td>....</td><td>...<p>Great item to truck</p></td>...<a href="javascript:selectItem('1.a.b.c.1.d.f.11')">1.a.b.c.1.d.f.11</a> ...

score 0 · Accepted Answer

解决方案 1

现在，如果您想在索引和存储内容之前完全删除 html，您可以使用映射器附件插件 - 当您定义映射时，您可以将 content_type 分类为“html”。

映射器附件对很多事情都很有用，尤其是在您处理多种文档类型时，但最值得注意的是 - 我相信仅使用它来去除 html 标签就足够了（您不能使用 html_strip 字符过滤器来做到这一点）。

只是一个警告 - 不会存储任何 html 标签。因此，如果您确实需要这些标签，我建议您定义另一个字段来存储原始内容。另一个注意事项：您不能为映射器附件文档指定多字段，因此您需要将其存储在映射器附件文档之外。请参阅下面的工作示例。

您需要生成此映射：

{
  "html5-es" : {
    "aliases" : { },
    "mappings" : {
      "document" : {
        "properties" : {
          "delete" : {
            "type" : "boolean"
          },
          "file" : {
            "type" : "attachment",
            "fields" : {
              "content" : {
                "type" : "string",
                "store" : true,
                "term_vector" : "with_positions_offsets",
                "analyzer" : "autocomplete"
              },
              "author" : {
                "type" : "string",
                "store" : true,
                "term_vector" : "with_positions_offsets"
              },
              "title" : {
                "type" : "string",
                "store" : true,
                "term_vector" : "with_positions_offsets",
                "analyzer" : "autocomplete"
              },
              "name" : {
                "type" : "string"
              },
              "date" : {
                "type" : "date",
               "format" : "strict_date_optional_time||epoch_millis"
              },
              "keywords" : {
                "type" : "string"
              },
              "content_type" : {
                "type" : "string"
              },
          "content_length" : {
                "type" : "integer"
              },
              "language" : {
                "type" : "string"
              }
            }
          },
          "hash_id" : {
            "type" : "string"
          },
          "path" : {
            "type" : "string"
          },
          "raw_content" : {
            "type" : "string",
            "store" : true,
            "term_vector" : "with_positions_offsets",
            "analyzer" : "raw"
          },
          "title" : {
            "type" : "string"
          }
        }
      }
    },
    "settings" : { //insert your own settings here },
    "warmers" : { }
  }
}

这样在 NEST 中，我将按如下方式组装内容：

Attachment attachment = new Attachment();
attachment.Content =   Convert.ToBase64String(File.ReadAllBytes("path/to/document"));
attachment.ContentType = "html";

Document document = new Document();
document.File = attachment;
document.RawContent = InsertRawContentFromString(originalText);

我在 Sense 中对此进行了测试 - 结果如下：

"file": {
    "_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",
    "_content_length": 0,
    "_content_type": "html",
    "_date": "0001-01-01T00:00:00",
    "_title": "Topic10"
},
"delete": false,
"raw_content": "<h1>Topic10</h1><p>Delete this text and replace it with your own content. Check your mailbox.</p><p> </p><p>asdf</p><p> </p><p>10</p><p> </p><p>Lavender.</p><p> </p><p>10/6 12:03</p><p> </p><p>5 09</p><p> </p><p>11 47</p><p> </p><p>Halloween is in October.</p><p> </p><p>jog</p>"
},
"highlight": {
"file.content": [
    "\n    <em>Topic10</em>\n\n    Delete this text and replace it with your own content. Check your mailbox.\n\n     \n\n    asdf\n\n     \n\n    10\n\n     \n\n    Lavender.\n\n     \n\n    10/6 12:03\n\n     \n\n    5 09\n\n     \n\n    11 47\n\n     \n\n    Halloween is in October.\n\n     \n\n    jog\n\n  "
    ]
}

解决方案 2

您需要构建一个 NGram 分析器来索引您的内容并使用标准分析器进行搜索。

      "analyzer" : {
        "standard" : {
          "type" : "standard"
        },
        "autocomplete" : {
          "filter" : [ "standard", "lowercase" ],
          "char_filter" : [ "html_strip" ],
          "type" : "custom",
          "tokenizer" : "ngram"
        }

这个例子：

输入：“棕色”

NGram 分析器：

[b]，[br]，[bro]，[brow]，[brown]
[r]、[ro]、[行]、[行]
[o], [ow], [自己的]
[w]，[wn]
[n]

因此，当您进行自动完成搜索时，它将匹配任何这些索引片段。但重要的是只用标准分析器搜索（返回一页结果），这样它就不会只匹配任何这些随机片段。

parsing - Elasticsearch 正确的策略来索引 html 文件的内容

2 回答 2

Related

Reference