python - 如何在 html 文件中执行与标签无关的文本字符串搜索？

Question

我正在使用启用了 --xmlfilter 选项的LanguageTool (LT) 来对 HTML 文件进行拼写检查。这会强制 LanguageTool 在运行拼写检查之前去除所有标签。

这也意味着所有报告的字符位置都是关闭的，因为 LT 没有“看到”标签。

例如，如果我检查以下 HTML 片段：

<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>

LanguageTool 会将其视为纯文本句子：

    This is kind of a stupid question.

并返回以下消息：

<error category="Grammar" categoryid="GRAMMAR" context="                This is kind of a stupid question.    " contextoffset="24" errorlength="9" fromx="8" fromy="8" locqualityissuetype="grammar" msg="Don't include 'a' after a classification term. Use simply 'kind of'." offset="24" replacements="kind of" ruleId="KIND_OF_A" shortmsg="Grammatical problem" subId="1" tox="17" toy="8"/>

（在这个特定的例子中，LT 已经标记了“kind of a”。）

由于搜索字符串可能包含在标签中并且可能会出现多次，因此我无法进行简单的索引搜索。

在 HTML 文件中可靠地定位任何给定文本字符串的最有效 Python 解决方案是什么？（LT 返回一个近似的字符位置，根据标签的数量以及标记单词之前和之后的单词，可能会偏离 10-30%。）

即我需要做一个忽略所有标签的搜索，但将它们包含在字符位置计数中。

在这个特定的例子中，我必须找到“kind of a”并找到字母 k 的位置：

kin<b>d</b> o<i>f</i>a

score 1 · Accepted Answer

这可能不是最快的方法，但 pyparsing 将识别大多数形式的 HTML 标签。下面的代码反转了典型的扫描，创建了一个匹配任何单个字符的扫描器，然后配置扫描器跳过 HTML 打开和关闭标签，以及常见的 HTML'&xxx;'实体。pyparsing 的scanString方法返回一个生成器，该生成器生成匹配的标记、每个匹配的开始和结束位置，因此很容易构建一个列表，将标签之外的每个字符映射到其原始位置。从那里开始，其余的几乎都是公正''.join的并索引到列表中。请参阅下面代码中的注释：

test = "<p>This &nbsp;is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>"

from pyparsing import Word, printables, anyOpenTag, anyCloseTag, commonHTMLEntity

non_tag_text = Word(printables+' ',  exact=1).leaveWhitespace()
non_tag_text.ignore(anyOpenTag | anyCloseTag | commonHTMLEntity)

# use scanString to get all characters outside of tags, and build list
# of (char,loc) tuples
char_locs = [(t[0], loc) for t,loc,endloc in non_tag_text.scanString(test)]

# imagine a world without HTML tags...
untagged = ''.join(ch for ch, loc in char_locs)

# look for our string in the untagged text, then index into the char,loc list
# to find the original location
search_str = 'kind of a'
orig_loc = char_locs[untagged.find(search_str)][1]

# print the test string, and mark where we found the matching text
print(test)
print(' '*orig_loc + '^')

"""
Should look like this:

<p>This &nbsp;is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
                 ^
"""

score 1 · Accepted Answer

由于此类问题，该--xmlfilter选项已被弃用。正确的解决方案是自己删除标签但保留位置，以便您有一个映射来更正从 LT 返回的结果。在 Java 中使用 LT 时，AnnotatedText支持这一点，但算法应该足够简单以便移植它。（完全披露：我是 LT 的维护者）

python - 如何在 html 文件中执行与标签无关的文本字符串搜索？

2 回答 2

Related

Reference