虽然是一个较老的问题,但我的回答是
我最近问了一个类似的问题(几天后没有回复或评论),我整理出来并且与这个问题相关。
多年来,Solr 发生了很大变化,关于该主题的现有文档(如果存在)既令人困惑,有时又是错误的。
虽然很长,但此回复通过示例和文档提供了该问题的解决方案。
简而言之,我现在删除的 StackOverflow 问题是“使用 Apache Solr 从 HTML 中提取自定义(例如 <my_id></my_id)标记文本”。该任务的辅助是如何索引 HTML 页面,包括自定义 HTML 元素:属性。
简短的回答是,虽然索引“标准”HTML 元素(a; div; h1; h2; li; meta; p; title; ... https://www.w3.org/TR/2005 /WD-xhtml2-20050527/elements.html),如果不严格使用格式正确的 XML 文件和 Solr 中的更新功能,很难包含自定义标签集(参见,例如:https ://lucene.apache.org/solr/ guide/6_6/uploading-data-with-index-handlers.html#uploading-data-with-index-handlers),或使用captureAttr
Apache Tika 的参数,通过ExtractingRequestHandler
(如下所述)或其他工具等作为 Apache Nutch。
标准 HTML 元素,例如<title>Solr HTML Indexing Tests</title>
易于索引;但是,像这样的非标准元素<my_id>bt-ic8eew2u</my_id>
会被忽略。
虽然您可以应用基于 XML 的解决方案,例如<field name="my_id">bt-ic8eew2u</field>
,但我更喜欢简单的基于 HTML 的解决方案——因此,HTML 元数据方法。
环境: Arch Linux (x86_64) 命令行;Apache Solr 8.7.0;FireFox 83.0 中的 Solr 管理 UI (http://localhost:8983/solr/#/gettingstarted/query)
测试文件(solr_test9.html):
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-us">
<head>
<meta charset="UTF-8" />
<title>Solr HTML Indexing Tests</title>
<meta name="date_created" content="2019-11-01" />
<meta name="source_url" content="/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html" />
<!-- <my_id>bt-ic8eew2u</my_id> -->
<meta name="doc_id" content="bt-ic8eeW2U" />
<meta name="date_pub" content="2020-11-16" />
</head>
<body>
<h1>Apples</h1>
<p>I like apples.</p>
<h2>Bananas</h2>
<p>I also like bananas.</p>
<p><div id="div1">This text is located in div element 1.</div></p>
<p><div id="div2">This text is located in div element 2.</div></p>
<br/>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
<br/>
<p>Suspendisse efficitur pulvinar elementum.</p>
<p>My website is <a href="https://buriedtruth.com/">BuriedTruth.com</a>.</p>
<h1>Nova Scotia</h1>
<p>Nova Scotia is a province on the east coast of Canada.</p>
<h2>Capital of Nova Scotia</h2>
<p>Halifax is the capital of N.S.</p>
<p>Halifax is also N.S.'s largest city.</p>
<h1>British Columbia</h1>
<h2>Capital of British Columbia</h2>
<p>Victoria is the capital of B.C.</p>
<p>Vancouver is the largest city in B.C., however.</p>
<p>Non-terminated sentence (missing period)</p>
<meta name="date_current" content="2020-11-17" />
<!-- Comments like these are not indexed. -->
<p>Current date: 2020-11-17</p>
</body>
</html>
solrconfig.xml
solrconfig.xml
这是我文件的相关补充。
<!-- SOLR CELL PLUGINS: -->
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
<!-- https://lucene.472066.n3.nabble.com/Prons-an-Cons-of-Startup-Lazy-a-Handler-td4059111.html -->
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="capture">div</str>
<str name="fmap.div">div</str>
<str name="capture">h1</str>
<str name="fmap.h1">h1</str>
<str name="capture">h2</str>
<str name="fmap.h2">h2_t</str>
<str name="capture">p</str>
<!-- <str name="fmap.p">p_t</str> -->
<str name="fmap.p">p</str>
<!-- COMMENT: note that the entries above refer to standard -->
<!-- HTML elements. As long as you have <meta/> (metadata) -->
<!-- entries ("doc-id", "date_pub" ...) in your schema then -->
<!-- Solr will automatically pick them up when indexing ... -->
<!-- (hence no need to include those, here!). -->
</lst>
</requestHandler>
<!-- https://doc.lucidworks.com/fusion-server/5.2/reference/solr-reference-guide/7.7.2/update-request-processors.html -->
<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<!-- ======================================== -->
<!-- https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html -->
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="fieldName">p</str>
<!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
<!-- of this processor as needed: -->
<str name="pattern">\s+</str>
<str name="replacement"> </str>
<bool name="literalReplacement">true</bool>
</processor>
<!-- Solr bug? URLs parse as "rect https..." Managed-schema (Admin UI): defined p as text_general -->
<!-- but did not parse. Looking at content | title: text_general copied to string, so added -->
<!-- copyfield of p (text_general) as p_str ... regex below now works! -->
<!-- https://stackoverflow.com/questions/22178700/solr-extractingrequesthandler-extracting-rect-in-links/64882751#64882751 -->
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="fieldName">p</str>
<!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
<!-- of this processor as needed: -->
<str name="pattern">rect http</str>
<str name="replacement">http</str>
<bool name="literalReplacement">true</bool>
</processor>
<!-- ======================================== -->
<!-- This needs to be last (may need to clear documents and re-index to see changes, e.g. Solr Admin UI): -->
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
托管模式(schema.xml):
我通过管理 UI 编辑了 Solr 架构。基本上,对于您想要索引的任何 HTML 元数据,添加一个名称相似的字段(具有适当的类型:例如,text_general
| string
| pdate
| ...)。
例如,为了捕获“doc-id”和“date_pub”元数据,我创建了以下(各自的)模式条目:
<field name="doc_id" type="string" uninvertible="true" indexed="true" stored="true"/>
<field name="date_pub" type="pdate" uninvertible="true" indexed="true" stored="true"/>
索引
这是我索引该 HTML 测试文件的方法,
[victoria@victoria solr-8.7.0]$ date; pwd; ls -l; echo; ls -l server/solr/gettingstarted/conf/
Tue Nov 17 02:18:12 PM PST 2020
/mnt/Vancouver/apps/solr/solr-8.7.0
total 1792
drwxr-xr-x 3 victoria victoria 4096 Nov 17 13:26 bin
-rw-r--r-- 1 victoria victoria 946955 Oct 28 02:40 CHANGES.txt
drwxr-xr-x 12 victoria victoria 4096 Oct 29 07:09 contrib
drwxr-xr-x 4 victoria victoria 4096 Nov 15 12:33 dist
drwxr-xr-x 3 victoria victoria 4096 Nov 15 12:33 docs
drwxr-xr-x 6 victoria victoria 4096 Oct 28 02:40 example
drwxr-xr-x 2 victoria victoria 36864 Oct 28 02:40 licenses
-rw-r--r-- 1 victoria victoria 12646 Oct 28 02:21 LICENSE.txt
-rw-r--r-- 1 victoria victoria 766662 Oct 28 02:40 LUCENE_CHANGES.txt
-rw-r--r-- 1 victoria victoria 27540 Oct 28 02:21 NOTICE.txt
-rw-r--r-- 1 victoria victoria 7490 Oct 28 02:40 README.txt
drwxr-xr-x 11 victoria victoria 4096 Nov 15 12:40 server
total 208
drwxr-xr-x 2 victoria victoria 4096 Oct 28 02:21 lang
-rw-r--r-- 1 victoria victoria 33888 Nov 17 13:20 managed-schema
-rw-r--r-- 1 victoria victoria 873 Oct 28 02:21 protwords.txt
-rw-r--r-- 1 victoria victoria 33788 Nov 17 11:36 schema.xml.2020-11-17.13:01
-rw-r--r-- 1 victoria victoria 59248 Nov 17 13:16 solrconfig.xml
-rw-r--r-- 1 victoria victoria 59151 Nov 17 12:59 solrconfig.xml.2020-11-17.13:01
-rw-r--r-- 1 victoria victoria 781 Oct 28 02:21 stopwords.txt
-rw-r--r-- 1 victoria victoria 1124 Oct 28 02:21 synonyms.txt
[victoria@victoria solr-8.7.0]$ solr restart; sleep 1; post -c gettingstarted /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html
Sending stop command to Solr running on port 8983 ... waiting up to 180 seconds to allow Jetty process 3511453 to stop gracefully.
Waiting up to 180 seconds to see Solr running on port 8983 [|]
Started Solr server on port 8983 (pid=3572520). Happy searching!
/usr/lib/jvm/java-8-openjdk/jre//bin/java -classpath /mnt/Vancouver/apps/solr/solr-8.7.0/dist/solr-core-8.7.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file solr_test9.html (text/html) to [base]/extract
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.755
[victoria@victoria solr-8.7.0]$
...这是结果(Solr Admin UI:http://localhost:8983/solr/#/gettingstarted/query)
http://localhost:8983/solr/gettingstarted/select?q=*%3A*
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"*:*",
"_":"1605651674401"}},
"response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
{
"id":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
"stream_size":[1428],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"date_created":"2019-11-01T00:00:00Z",
"date_current":["2020-11-17"],
"resourcename":["/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html"],
"title":["Solr HTML Indexing Tests"],
"date_pub":"2020-11-16T00:00:00Z",
"doc_id":"bt-ic8eeW2U",
"source_url":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
"dc_title":["Solr HTML Indexing Tests"],
"content_encoding":["UTF-8"],
"content_type":["application/xhtml+xml; charset=UTF-8"],
"content":[" en-us stream_size 1428 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html date_created 2019-11-01 resourceName /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html date_pub 2020-11-16 doc_id bt-ic8eeW2U source_url /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html dc:title Solr HTML Indexing Tests Content-Encoding UTF-8 Content-Language en-us Content-Type application/xhtml+xml; charset=UTF-8 Solr HTML Indexing Tests Lorem ipsum dolor sit amet, consectetur adipiscing elit. "],
"div":[" div1 This text is located in div element 1. div2 This text is located in div element 2."],
"p":[" I like apples. I also like bananas. Suspendisse efficitur pulvinar elementum. My website is https://buriedtruth.com/ BuriedTruth.com . Nova Scotia is a province on the east coast of Canada. Halifax is the capital of N.S. Halifax is also N.S.'s largest city. Victoria is the capital of B.C. Vancouver is the largest city in B.C., however. Non-terminated sentence (missing period) Current date: 2020-11-17"],
"h1":[" Apples Nova Scotia British Columbia"],
"h2_t":" Bananas Capital of Nova Scotia Capital of British Columbia",
"_version_":1683647678197530624}]
}}
更新—— managed-schema
>>schema.xml
特点:
虽然与原始问题无关,但以下内容与我的回答(上图)有关——特别是与从 Solr 切换managed-schema
到经典(用户管理)相关的特性schema.xml
。它包含在此处以提供完整的解决方案。
首先,添加
<schemaFactory class="ClassicIndexSchemaFactory"/>
到你的solrconfig.xml
文件。
然后编辑这个:-->
<updateRequestProcessorChain
name="add-unknown-fields-to-the-schema"
default="${update.autoCreateFields:true}"
processor="uuid,remove-blank,field-name-mutating,parse-boolean,
parse-long,parse-double,parse-date,add-schema-fields">
...对此:
<updateRequestProcessorChain
processor="uuid,remove-blank,field-name-mutating,parse-boolean,
parse-long,parse-double,parse-date">
即,删除
name="add-unknown-fields-to-the-schema"
default="${update.autoCreateFields:true}"
add-schema-fields
重命名managed-schema
为schema.xml
,然后重新启动 Solr 或重新加载核心以使更改生效。
为了进一步扩展我的示例(上),这里是一个示例 <updateRequestProcessorChain />
和输出,位于我还提供的 HTML 代码上(上)。
solrconfig.xml(部分):
<updateRequestProcessorChain
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="fieldName">p</str>
<!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
<!-- of this processor as needed: -->
<str name="pattern">\s+</str>
<str name="replacement"> </str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="fieldName">p</str>
<!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
<!-- of this processor as needed: -->
<str name="pattern">rect http</str>
<str name="replacement">http</str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="pattern">[sS]olr</str>
<str name="replacement">APPLE</str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="pattern">HTML</str>
<str name="replacement">BANANA</str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
输出
{
"responseHeader":{
"status":0,
"QTime":32,
"params":{
"q":"*:*",
"_":"1605767164812"}},
"response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
{
"id":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
"stream_size":[1628],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"date_created":"2020-11-11T21:36:38Z",
"date_current":["2020-11-17"],
"resourcename":["/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html"],
"title":["APPLE BANANA Indexing Tests"],
"date_pub":"2020-11-16T21:37:18Z",
"doc_id":"bt-ic8eeW2U",
"source_url":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
"dc_title":["Solr HTML Indexing Tests"],
"content_encoding":["UTF-8"],
"content_type":["application/xhtml+xml; charset=UTF-8"],
"content":[" en-us stream_size 1628 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html date_created 2020-11-11T21:36:38Z resourceName /mnt/Vancouver/programming/datasci/APPLE/test/APPLE_test9.html date_pub 2020-11-16T21:37:18Z doc_id bt-ic8eeW2U source_url /mnt/Vancouver/programming/datasci/APPLE/test/APPLE_test9.html dc:title APPLE BANANA Indexing Tests Content-Encoding UTF-8 Content-Language en-us Content-Type application/xhtml+xml; charset=UTF-8 APPLE BANANA Indexing Tests Lorem ipsum dolor sit amet, consectetur adipiscing elit. "],
"div":[" div1 This text is located in div element 1. div2 This text is located in div element 2. apple This text is located in the \"apple\" (class) div element. banana This text is located in the \"banana\" (class) div element."],
"p":[" I like apples. I also like bananas. Suspendisse efficitur pulvinar elementum. My website is https://buriedtruth.com/ BuriedTruth.com . Nova Scotia is a province on the east coast of Canada. Halifax is the capital of N.S. Halifax is also N.S.'s largest city. Victoria is the capital of B.C. Vancouver is the largest city in B.C., however. Non-terminated sentence (missing period) Current date: 2020-11-17"],
"h1":[" Apples Nova Scotia British Columbia"],
"h2_t":" Bananas Capital of Nova Scotia Capital of British Columbia",
"_version_":1683814668971278336}]
}}