solr - 如何从 HTML 文件中提取元标记并在 SOLR 和 TIKA 中对其进行索引

Question

我正在尝试提取 HTML 文件的元标记并将它们索引到具有 tika 集成的 solr 中。我无法使用 Tika 提取这些元标记，也无法在 solr 中显示。

我的 HTML 文件看起来像这样。

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="product_id" content="11"/>
<meta name="assetid" content="10001"/>
<meta name="title" content="title of the article"/>
<meta name="type" content="0xyzb"/>
<meta name="category" content="article category"/>
<meta name="first" content="details of the article"/>

<h4>title of the article</h4>
<p class="link"><a href="#link">How cite the Article</a></p>
<p class="list">
  <span class="listterm">Length: </span>13 to 15 feet<br>
  <span class="listterm">Height to Top of Head: </span>up to 18 feet<br>
  <span class="listterm">Weight: </span>1,200 to 4,300 pounds<br>
  <span class="listterm">Diet: </span>leaves and branches of trees<br>
  <span class="listterm">Number of Young: </span>1<br>
  <span class="listterm">Home: </span>Sahara<br>
</p>
</p>

我的 data-config.xml 文件看起来像这样

<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
    <document>   
    <entity name="f" dataSource="null" rootEntity="false"
        processor="FileListEntityProcessor"
        baseDir="/path/to/html/files/" 
        fileName=".*html|xml" onError="skip"
        recursive="false">

        <field column="fileAbsolutePath" name="path" />
        <field column="fileSize" name="size"/>
        <field column="file" name="filename"/>

        <entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" 
        url="${f.fileAbsolutePath}" format="text" onError="skip">

        <field column="product_id" name="product_id" meta="true"/>
        <field column="assetid" name="assetid" meta="true"/>
        <field column="title" name="title" meta="true"/>
        <field column="type" name="type" meta="true"/>
        <field column="first" name="first" meta="true"/>
        <field column="category" name="category" meta="true"/>      
        </entity>
    </entity>
</document>
</dataConfig>

在我的 schema.xml 文件中，我添加了以下字段。

<field name="product_id" type="string" indexed="true" stored="true"/>
<field name="assetid" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="category" type="string" indexed="true" stored="true"/>
<field name="first" type="text_general" indexed="true" stored="true"/>

在我的 solrconfing.xml 文件中，我添加了以下代码。

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" />
<lst name="defaults">
  <str name="config">/path/to/data-config.xml</str>
</lst>

谁能知道如何从 HTML 文件中提取这些元标记并在 solr 和 Tika 中对它们进行索引？您的帮助将不胜感激。

score 1 · Accepted Answer

I don't think meta="true" means what you think it means. It usually refers to things that are about the file rather than the content. So, content-type, etc. Possibly http-equiv will get mapped as well.

Other than that, you need to extract actual content. You can do it by using format="xml" and then putting an inner entity with XPathEntityProcessor and mapping the path then. Except, even then, you are limited because stuck because AFAIK, DIH uses DefaultHtmlMapper which is extremely restrictive in what it let's through and skips most of the 'class' and 'id' attributes and even things like 'div'. You can read the list of allowed elements and attributes by yourself in the source code.

Frankly, your easier path is to have a SolrJ client and manage Tika yourself. Then you can set it to use IdentityHtmlMapper which does not muck about with HTML.

score 1 · Accepted Answer

您使用的是哪个版本的 Solr？如果您使用的是Solr 4.0或更高版本，则tika已嵌入其中。Tika 使用 solrconfig.xml 中配置的“Solr-Cells” “ExtractingRequestHandler”类与 solr 通信，如下所示：

      <!-- Solr Cell Update Request Handler

       http://wiki.apache.org/solr/ExtractingRequestHandler 

    -->
  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

现在默认情况下在 solr 中，如您在上面的配置中所见，从 HTML 文档中提取的任何未在 schema.xml 中声明的字段都以“ignored_”为前缀，即它们映射到schema.xml 中的“ignored_ *”动态字段。默认的 schema.xml 如下所示：

       <!-- some trie-coded dynamic fields for faster range queries -->
   <dynamicField name="*_ti" type="tint"    indexed="true"  stored="true"/>
   <dynamicField name="*_tl" type="tlong"   indexed="true"  stored="true"/>
   <dynamicField name="*_tf" type="tfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_td" type="tdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_tdt" type="tdate"  indexed="true"  stored="true"/>

   <dynamicField name="*_pi"  type="pint"    indexed="true"  stored="true"/>
   <dynamicField name="*_c"   type="currency" indexed="true"  stored="true"/>

   <dynamicField name="ignored_*" type="ignored" multiValued="true"/>
   <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

   <dynamicField name="random_*" type="random" />

   <!-- uncomment the following to ignore any fields that don't already match an existing 
        field name or dynamic field, rather than reporting them as an error. 
        alternately, change the type="ignored" to some other type e.g. "text" if you want 
        unknown fields indexed and/or stored by default --> 
   <!--dynamicField name="*" type="ignored" multiValued="true" /-->

 </fields>

以下是“忽略”类型的处理方式：

<!-- since fields of this type are by default not stored or indexed,
     any data added to them will be ignored outright.  --> 
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />

因此，tika 提取的元数据默认情况下被 Solr-Cell 放在“忽略”字段中，这就是为什么它们在索引和存储时被忽略的原因。因此，要索引和存储元数据，您可以更改“uprefix=attr_”或“为已知元数据创建特定字段或动态字段”，并根据需要处理它们。

所以，这里是更正后的 solrconfig.xml：

  <!-- Solr Cell Update Request Handler

       http://wiki.apache.org/solr/ExtractingRequestHandler 

    -->
  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">attr_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

score 0 · Accepted Answer

虽然是一个较老的问题，但我的回答是

我最近问了一个类似的问题（几天后没有回复或评论），我整理出来并且与这个问题相关。
多年来，Solr 发生了很大变化，关于该主题的现有文档（如果存在）既令人困惑，有时又是错误的。
虽然很长，但此回复通过示例和文档提供了该问题的解决方案。

简而言之，我现在删除的 StackOverflow 问题是“使用 Apache Solr 从 HTML 中提取自定义（例如 <my_id></my_id）标记文本”。该任务的辅助是如何索引 HTML 页面，包括自定义 HTML 元素：属性。

简短的回答是，虽然索引“标准”HTML 元素（a; div; h1; h2; li; meta; p; title; ... https://www.w3.org/TR/2005 /WD-xhtml2-20050527/elements.html），如果不严格使用格式正确的 XML 文件和 Solr 中的更新功能，很难包含自定义标签集（参见，例如：https ://lucene.apache.org/solr/ guide/6_6/uploading-data-with-index-handlers.html#uploading-data-with-index-handlers），或使用captureAttrApache Tika 的参数，通过ExtractingRequestHandler（如下所述）或其他工具等作为 Apache Nutch。

标准 HTML 元素，例如<title>Solr HTML Indexing Tests</title>易于索引；但是，像这样的非标准元素<my_id>bt-ic8eew2u</my_id>会被忽略。

虽然您可以应用基于 XML 的解决方案，例如<field name="my_id">bt-ic8eew2u</field>，但我更喜欢简单的基于 HTML 的解决方案——因此，HTML 元数据方法。

环境： Arch Linux (x86_64) 命令行；Apache Solr 8.7.0；FireFox 83.0 中的 Solr 管理 UI (http://localhost:8983/solr/#/gettingstarted/query)

测试文件（solr_test9.html）：

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-us">
<head>
  <meta charset="UTF-8" />
  <title>Solr HTML Indexing Tests</title>
  <meta name="date_created" content="2019-11-01" />
  <meta name="source_url" content="/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html" />
  <!-- <my_id>bt-ic8eew2u</my_id> -->
  <meta name="doc_id" content="bt-ic8eeW2U" />
  <meta name="date_pub" content="2020-11-16" />
</head>

<body>
<h1>Apples</h1>
<p>I like apples.</p>

<h2>Bananas</h2>
<p>I also like bananas.</p>

<p><div id="div1">This text is located in div element 1.</div></p>
<p><div id="div2">This text is located in div element 2.</div></p>

<br/>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
<br/>

<p>Suspendisse efficitur pulvinar elementum.</p>

<p>My website is <a href="https://buriedtruth.com/">BuriedTruth.com</a>.</p>

<h1>Nova Scotia</h1>
<p>Nova Scotia is a province on the east coast of Canada.</p>

<h2>Capital of Nova Scotia</h2>
<p>Halifax is the capital of N.S.</p>
<p>Halifax is also N.S.'s largest city.</p>

<h1>British Columbia</h1>
<h2>Capital of British Columbia</h2>
<p>Victoria is the capital of B.C.</p>
<p>Vancouver is the largest city in B.C., however.</p>

<p>Non-terminated sentence (missing period)</p>

<meta name="date_current" content="2020-11-17" />
<!-- Comments like these are not indexed. -->
<p>Current date: 2020-11-17</p>

</body>
</html>

solrconfig.xml

solrconfig.xml这是我文件的相关补充。

  <!-- SOLR CELL PLUGINS: -->
  <lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />

  <!-- https://lucene.472066.n3.nabble.com/Prons-an-Cons-of-Startup-Lazy-a-Handler-td4059111.html -->
  <requestHandler name="/update/extract"
    class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
      <str name="capture">div</str>
      <str name="fmap.div">div</str>
      <str name="capture">h1</str>
      <str name="fmap.h1">h1</str>
      <str name="capture">h2</str>
      <str name="fmap.h2">h2_t</str>
      <str name="capture">p</str>
      <!-- <str name="fmap.p">p_t</str> -->
      <str name="fmap.p">p</str>
      <!-- COMMENT: note that the entries above refer to standard -->
      <!-- HTML elements.  As long as you have <meta/> (metadata) -->
      <!-- entries ("doc-id", "date_pub" ...) in your schema then -->
      <!-- Solr will automatically pick them up when indexing ... -->
      <!-- (hence no need to include those, here!).               -->
    </lst>
  </requestHandler>

  <!-- https://doc.lucidworks.com/fusion-server/5.2/reference/solr-reference-guide/7.7.2/update-request-processors.html -->
  <!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
           processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <!-- ======================================== -->
    <!-- https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html -->
    <processor class="solr.RegexReplaceProcessorFactory">
      <str name="fieldName">content</str>
      <str name="fieldName">title</str>
      <str name="fieldName">p</str>
      <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
      <!-- of this processor as needed: -->
      <str name="pattern">\s+</str>
      <str name="replacement"> </str>
      <bool name="literalReplacement">true</bool>
    </processor>

    <!-- Solr bug? URLs parse as "rect https..."  Managed-schema (Admin UI): defined p as text_general -->
    <!-- but did not parse. Looking at content | title: text_general copied to string, so added  -->
    <!-- copyfield of p (text_general) as p_str ... regex below now works! -->
    <!-- https://stackoverflow.com/questions/22178700/solr-extractingrequesthandler-extracting-rect-in-links/64882751#64882751 -->
      <processor class="solr.RegexReplaceProcessorFactory">
      <str name="fieldName">content</str>
      <str name="fieldName">title</str>
      <str name="fieldName">p</str>
      <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
      <!-- of this processor as needed: -->
      <str name="pattern">rect http</str>
      <str name="replacement">http</str>
      <bool name="literalReplacement">true</bool>
    </processor>
    <!-- ======================================== -->
    <!-- This needs to be last (may need to clear documents and re-index to see changes, e.g. Solr Admin UI): -->
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

托管模式（schema.xml）：

我通过管理 UI 编辑了 Solr 架构。基本上，对于您想要索引的任何 HTML 元数据，添加一个名称相似的字段（具有适当的类型：例如，text_general| string| pdate| ...）。

例如，为了捕获“doc-id”和“date_pub”元数据，我创建了以下（各自的）模式条目：

<field name="doc_id" type="string" uninvertible="true" indexed="true" stored="true"/>
<field name="date_pub" type="pdate" uninvertible="true" indexed="true" stored="true"/>

索引

这是我索引该 HTML 测试文件的方法，

[victoria@victoria solr-8.7.0]$ date; pwd; ls -l; echo; ls -l server/solr/gettingstarted/conf/

Tue Nov 17 02:18:12 PM PST 2020

/mnt/Vancouver/apps/solr/solr-8.7.0
total 1792
drwxr-xr-x  3 victoria victoria   4096 Nov 17 13:26 bin
-rw-r--r--  1 victoria victoria 946955 Oct 28 02:40 CHANGES.txt
drwxr-xr-x 12 victoria victoria   4096 Oct 29 07:09 contrib
drwxr-xr-x  4 victoria victoria   4096 Nov 15 12:33 dist
drwxr-xr-x  3 victoria victoria   4096 Nov 15 12:33 docs
drwxr-xr-x  6 victoria victoria   4096 Oct 28 02:40 example
drwxr-xr-x  2 victoria victoria  36864 Oct 28 02:40 licenses
-rw-r--r--  1 victoria victoria  12646 Oct 28 02:21 LICENSE.txt
-rw-r--r--  1 victoria victoria 766662 Oct 28 02:40 LUCENE_CHANGES.txt
-rw-r--r--  1 victoria victoria  27540 Oct 28 02:21 NOTICE.txt
-rw-r--r--  1 victoria victoria   7490 Oct 28 02:40 README.txt
drwxr-xr-x 11 victoria victoria   4096 Nov 15 12:40 server

total 208
drwxr-xr-x 2 victoria victoria  4096 Oct 28 02:21 lang
-rw-r--r-- 1 victoria victoria 33888 Nov 17 13:20 managed-schema
-rw-r--r-- 1 victoria victoria   873 Oct 28 02:21 protwords.txt
-rw-r--r-- 1 victoria victoria 33788 Nov 17 11:36 schema.xml.2020-11-17.13:01
-rw-r--r-- 1 victoria victoria 59248 Nov 17 13:16 solrconfig.xml
-rw-r--r-- 1 victoria victoria 59151 Nov 17 12:59 solrconfig.xml.2020-11-17.13:01
-rw-r--r-- 1 victoria victoria   781 Oct 28 02:21 stopwords.txt
-rw-r--r-- 1 victoria victoria  1124 Oct 28 02:21 synonyms.txt

[victoria@victoria solr-8.7.0]$ solr restart; sleep 1; post -c gettingstarted /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html

Sending stop command to Solr running on port 8983 ... waiting up to 180 seconds to allow Jetty process 3511453 to stop gracefully.
Waiting up to 180 seconds to see Solr running on port 8983 [|]  
Started Solr server on port 8983 (pid=3572520). Happy searching!

/usr/lib/jvm/java-8-openjdk/jre//bin/java -classpath /mnt/Vancouver/apps/solr/solr-8.7.0/dist/solr-core-8.7.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file solr_test9.html (text/html) to [base]/extract
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.755

[victoria@victoria solr-8.7.0]$

...这是结果（Solr Admin UI：http://localhost:8983/solr/#/gettingstarted/query）

http://localhost:8983/solr/gettingstarted/select?q=*%3A*

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*",
      "_":"1605651674401"}},
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
        "stream_size":[1428],
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.html.HtmlParser"],
        "stream_content_type":["text/html"],
        "date_created":"2019-11-01T00:00:00Z",
        "date_current":["2020-11-17"],
        "resourcename":["/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html"],
        "title":["Solr HTML Indexing Tests"],
        "date_pub":"2020-11-16T00:00:00Z",
        "doc_id":"bt-ic8eeW2U",
        "source_url":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
        "dc_title":["Solr HTML Indexing Tests"],
        "content_encoding":["UTF-8"],
        "content_type":["application/xhtml+xml; charset=UTF-8"],
        "content":[" en-us stream_size 1428 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html date_created 2019-11-01 resourceName /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html date_pub 2020-11-16 doc_id bt-ic8eeW2U source_url /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html dc:title Solr HTML Indexing Tests Content-Encoding UTF-8 Content-Language en-us Content-Type application/xhtml+xml; charset=UTF-8 Solr HTML Indexing Tests Lorem ipsum dolor sit amet, consectetur adipiscing elit. "],
        "div":[" div1 This text is located in div element 1. div2 This text is located in div element 2."],
        "p":[" I like apples. I also like bananas. Suspendisse efficitur pulvinar elementum. My website is https://buriedtruth.com/ BuriedTruth.com . Nova Scotia is a province on the east coast of Canada. Halifax is the capital of N.S. Halifax is also N.S.'s largest city. Victoria is the capital of B.C. Vancouver is the largest city in B.C., however. Non-terminated sentence (missing period) Current date: 2020-11-17"],
        "h1":[" Apples Nova Scotia British Columbia"],
        "h2_t":" Bananas Capital of Nova Scotia Capital of British Columbia",
        "_version_":1683647678197530624}]
  }}

更新—— managed-schema>>schema.xml特点：

虽然与原始问题无关，但以下内容与我的回答（上图）有关——特别是与从 Solr 切换managed-schema到经典（用户管理）相关的特性schema.xml。它包含在此处以提供完整的解决方案。

首先，添加

<schemaFactory class="ClassicIndexSchemaFactory"/>

到你的solrconfig.xml文件。

然后编辑这个：-->

<updateRequestProcessorChain
  name="add-unknown-fields-to-the-schema"
  default="${update.autoCreateFields:true}"
  processor="uuid,remove-blank,field-name-mutating,parse-boolean,
             parse-long,parse-double,parse-date,add-schema-fields">

...对此：

<updateRequestProcessorChain
  processor="uuid,remove-blank,field-name-mutating,parse-boolean,
             parse-long,parse-double,parse-date">

即，删除

  name="add-unknown-fields-to-the-schema"
  default="${update.autoCreateFields:true}"
  add-schema-fields

重命名managed-schema为schema.xml，然后重新启动 Solr 或重新加载核心以使更改生效。

为了进一步扩展我的示例（上），这里是一个示例 <updateRequestProcessorChain />和输出，位于我还提供的 HTML 代码上（上）。

solrconfig.xml（部分）：

<updateRequestProcessorChain
  processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date">
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.DistributedUpdateProcessorFactory"/>
  <processor class="solr.RegexReplaceProcessorFactory">
    <str name="fieldName">content</str>
    <str name="fieldName">title</str>
    <str name="fieldName">p</str>
    <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
    <!-- of this processor as needed: -->
    <str name="pattern">\s+</str>
    <str name="replacement"> </str>
    <bool name="literalReplacement">true</bool>
  </processor>

  <processor class="solr.RegexReplaceProcessorFactory">
    <str name="fieldName">content</str>
    <str name="fieldName">title</str>
    <str name="fieldName">p</str>
    <!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
    <!-- of this processor as needed: -->
    <str name="pattern">rect http</str>
    <str name="replacement">http</str>
    <bool name="literalReplacement">true</bool>
  </processor>

  <processor class="solr.RegexReplaceProcessorFactory">
    <str name="fieldName">content</str>
    <str name="fieldName">title</str>
    <str name="pattern">[sS]olr</str>
    <str name="replacement">APPLE</str>
    <bool name="literalReplacement">true</bool>
  </processor>

  <processor class="solr.RegexReplaceProcessorFactory">
    <str name="fieldName">content</str>
    <str name="fieldName">title</str>
    <str name="pattern">HTML</str>
    <str name="replacement">BANANA</str>
    <bool name="literalReplacement">true</bool>
  </processor>

  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

输出

笔记：
- 更改为“标题”（Solr>> APPLE; HTML>> BANANA）
- 从“p”中的 URL 中删除“rect”（在此讨论：Solr ExtractingRequestHandler 在链接中提取“rect”）

{
  "responseHeader":{
    "status":0,
    "QTime":32,
    "params":{
      "q":"*:*",
      "_":"1605767164812"}},
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
        "stream_size":[1628],
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.html.HtmlParser"],
        "stream_content_type":["text/html"],
        "date_created":"2020-11-11T21:36:38Z",
        "date_current":["2020-11-17"],
        "resourcename":["/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html"],
        "title":["APPLE BANANA Indexing Tests"],
        "date_pub":"2020-11-16T21:37:18Z",
        "doc_id":"bt-ic8eeW2U",
        "source_url":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
        "dc_title":["Solr HTML Indexing Tests"],
        "content_encoding":["UTF-8"],
        "content_type":["application/xhtml+xml; charset=UTF-8"],
        "content":[" en-us stream_size 1628 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html date_created 2020-11-11T21:36:38Z resourceName /mnt/Vancouver/programming/datasci/APPLE/test/APPLE_test9.html date_pub 2020-11-16T21:37:18Z doc_id bt-ic8eeW2U source_url /mnt/Vancouver/programming/datasci/APPLE/test/APPLE_test9.html dc:title APPLE BANANA Indexing Tests Content-Encoding UTF-8 Content-Language en-us Content-Type application/xhtml+xml; charset=UTF-8 APPLE BANANA Indexing Tests Lorem ipsum dolor sit amet, consectetur adipiscing elit. "],
        "div":[" div1 This text is located in div element 1. div2 This text is located in div element 2. apple This text is located in the \"apple\" (class) div element. banana This text is located in the \"banana\" (class) div element."],
        "p":[" I like apples. I also like bananas. Suspendisse efficitur pulvinar elementum. My website is https://buriedtruth.com/ BuriedTruth.com . Nova Scotia is a province on the east coast of Canada. Halifax is the capital of N.S. Halifax is also N.S.'s largest city. Victoria is the capital of B.C. Vancouver is the largest city in B.C., however. Non-terminated sentence (missing period) Current date: 2020-11-17"],
        "h1":[" Apples Nova Scotia British Columbia"],
        "h2_t":" Bananas Capital of Nova Scotia Capital of British Columbia",
        "_version_":1683814668971278336}]
  }}

solr - 如何从 HTML 文件中提取元标记并在 SOLR 和 TIKA 中对其进行索引

3 回答 3

Related

Reference