我需要在 HTML 文档中插入 CSS id 来标记段落和句子。格式化 HTML 有很多不同的方法,因此很难找到一种一致的方法来解析它们。例如,一些蹩脚的 html use <table>
,另一些 use <P>
,一些其他 use <div>
,等等。一些使用组合。
输入:
<p> This is a sentence, with stuff. Mr. John doe was walking down the street. Mrs. Daisy knows how to drive but does not drive. The car is fast, but is an ugly color. This is an example of a paragraph. </P>
<br>
<div> However, sometimes, paragraphs on HTML pages are not tagged as with a consistent format. This makes it hard to identify paragraphs and sentences. I need a solution to tag them with CSS id's</div>
输出
<p><span id="paragraph1"> <span id="sentence1">This is a sentence, with stuff.</span><span id="sentence2"> Mr. John doe was walking down the street. </span><span id="sent3"> Mrs. Daisy knows how to drive but does not drive. </span> <span id="sent4"> The car is fast, but is an ugly color.</span> <span id="sent4"> This is an example of a paragraph.</span> </span> </P>
</br>
<div><span id="paragraph2"> <span id="sent5">However, sometimes, paragraphs on HTML pages are not tagged as with a consistent format.</span><span id="sent6"> This makes it hard to identify paragraphs and sentences.</span> <span id="sent7"> I need a solution to tag them with CSS id's</span></span></div>
1) 可以使用什么解决方案来识别 HTML 中的段落并标记它们。
2) OpenNLP 非常适合识别句子,但我没有看到 html 剥离器。
我在想我可以使用 Tika 剥离 HTML 并将其输入 OpenNLP 以识别句子,但是我丢失了所有格式并且不知道将标签放回原始 HTML 的位置。