regex - 如何删除`
翻译自：https://stackoverflow.com/questions/69845981 2021-11-04T21:21:07.130

63 次

Question

score 0 · Accepted Answer

假设：

OP 无法访问以 HTML 为中心的工具
删除<a href="file:...">...some_text...</a>包装器只留下...some_text...
仅适用于file:条目
file:输入数据在条目中间没有换行符/提要

显示多个条目的示例数据file:散布着一些其他（无意义的）条目：

$ cat sample.html
<p><a href="https:/google.com">some text</a><a href="file://any" >keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p><a href="file://anyother" >keep this text,too</a>, last test</p>

删除所有条目sed的包装器的一个想法：file:

sed -E 's|<a[^<>]+file:[^>]+>([^<]+)</a>|\1|g' "${infile}"

注意：某些条目可能有点矫枉过正，[^..]但关键目标是短路sed's默认贪婪匹配......

这留下：

<p><a href="https:/google.com">some text</a>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>keep this text,too, last test</p>

score 0 · Accepted Answer

考虑到<a>标签由多行内容组成的情况，perl解决方案如何：

perl -0777 -i -pe 's#<a.+?href="?file.+?>(.+?)</a>#$1#gs' file.xhtml

该-0777选项告诉 perl slurp 整个文件。
该-i选项启用就地编辑。
s运算符末尾的开关s使点匹配任何字符，包括换行符。
正则表达式是启用最短匹配.+?的非贪婪版本。.+

score 0 · Accepted Answer

单程：

sed -E 's,<a[^>]*?href="file://[^>]*>([^<]*)</a>,\1,g'

<a[^>]*?href="file://[^>]*>匹配<a+ 任意数量的非>（非贪婪）后跟href="file://+ 任意数量的非>字符，后跟>
([^<]*)匹配并捕获任意数量的非<字符
匹配</a>

匹配的所有内容都被捕获替换，\1并且结尾g使其在每行的每次出现时都进行替换。

例子：

$ cat data
<p><a class="a" href="file://any" id="b">keep this text</a>, <a id="file:ex" href="http://example.com/abc">example.com/abc</a>, more text</p>
<p><a href="file://any" class="f">keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

$ sed -E 's,<a[^>]*?href="file://[^>]*>([^<]*)</a>,\1,g' < data
<p>keep this text, <a id="file:ex" href="http://example.com/abc">example.com/abc</a>, more text</p>
<p>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

regex - 如何删除` 翻译自：https://stackoverflow.com/questions/69845981 2021-11-04T21:21:07.130 63 次

3 回答 3

Related

Reference

regex - 如何删除`
翻译自：https://stackoverflow.com/questions/69845981 2021-11-04T21:21:07.130

63 次