php - 找到单词但不在链接中

Question

我需要一个 reg 表达式，它将在 html 中找到一个或多个目标词（所以在标签中），但不在锚或脚本标签中。我已经尝试了很长时间并想出了这个

(?!<(script|a).*?>)(\btype 2 diabetes\b)(?!<\/(a|script)>)

假设在这种情况下要替换的目标是 2 型糖尿病

我虽然这将是一个常见问题，但所有引用都是对锚的一部分，而不是根本不在锚或脚本标签中，而是在它们和其他标签中

这是我在上面的表达式和下面的测试数据中使用了http://regexpal.com/ 和 http://gskinner.com/RegExr/ 的测试数据，尽我所能尝试我不能排除其中的位锚点或脚本标签，但不排除锚点或脚本标签集之间的位。

在下面的测试数据中只有“2型糖尿病”里面

<p></p>

应该被抓住。

<a href="https://www.testsite.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>
<p>type 2 Diabetes</p>
<a id="logo" href="https://www.help-diabetes.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>

score 0 · Accepted Answer

要在目标词出现时进行替换，避免使用a和script标签，您必须尝试在目标词之前匹配这些标签（及其内容）。例子：

$subject = <<<LOD
<a href="https://www.testsite.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>
<p>type 2 Diabetes</p>
<a id="logo" href="https://www.help-diabetes.org.uk">
<div><img alt="logo" src="/images/logo.png" height="115" width="200" /></div>
<h2>Healthy Living for People with type 2 Diabetes</h2>
</a>
LOD;

$targets = array('type 2 diabetes', 'scarlet fever', 'bubonic plague');

$pattern = '~<(a|script)\b.+?</\1>|\b(?>' . implode('|', $targets) . ')\b~si';

$result = preg_replace_callback($pattern,
    function ($m) { return (isset($m[1])) ? $m[0] : '!!!rabbit!!!'; },
    $subject);

echo htmlspecialchars($result);

回调函数在设置第一个捕获组时返回a或script标签，或替换字符串。

请注意，如果您想对每个目标词进行特定替换，可以使用关联数组：

$corr = array( 'type 2 diabetes' => 'marmot',
               'scarlet fever'   => 'nutria',
               'bubonic plague'  => 'weasel'  );

$pattern = '~<(a|script)\b.+?</\1>|\b(?>'
         . implode('|', array_keys($corr)) . ')\b~si';

$result = preg_replace_callback($pattern,
    function ($m) use ($corr) {
        return (isset($m[1])) ? $m[0] : $corr[strtolower($m[0])];
    },
    $subject);

请记住，处理 html 的最佳方法是使用 DOM

score 0 · Accepted Answer

不要对这个问题使用正则表达式。使用 html 解析器。这是 Python 中使用 BeautifulSoup 的解决方案：

from BeautifulSoup import BeautifulSoup

with open('Path/to/file', 'r') as content_file:
    content = content_file.read()

soup = BeautifulSoup(content)

matches = [el for el in soup(text=re.compile(r'type 2 diabetes')) if el.name not in ['a','script']]

# now you can modify the matched elements

with open('Path/to/file.modified', 'w') as output_file:
    output_file.write(str(soup))

php - 找到单词但不在链接中

2 回答 2

Related

Reference