regex - sed 正则表达式可以模拟后视和前瞻吗？

Question

我正在尝试编写一个 sed 脚本，它将捕获文本文件中的所有“裸”URL 并将它们替换为<a href=[URL]>[URL]</a>. “裸”是指未包含在锚标记内的 URL。

我最初的想法是我应该匹配前面没有 " 或 > 并且后面也没有 < 或 " 的 URL。但是，我在表达“没有前面或后面”的概念时遇到了困难，因为据我所知 sed 没有前瞻或后视。

样本输入：

[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]

样本期望输出：

[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foo.bar">http://foo.bar</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]

观察到第三行没有被修改，因为它已经在里面了<a href>。另一方面，第一行和第二行都被修改了。最后，观察所有非 URL 文本都未修改。

最终，我正在尝试做类似的事情：

sed s/[^>"](http:\/\/[^\s]\+)/<a href="\1">\1<\/a>/g 2-7-2013

我首先验证以下内容是否正确匹配并删除 URL：

sed 's/http:\/\/[^\s]\+//g'

然后我尝试了这个，但它无法匹配从文件/输入开头开始的 URL：

sed 's/[^\>"]http:\/\/[^\s]\+//g'

有没有办法在 sed 中解决这个问题，或者通过模拟后向/前瞻，或者显式匹配文件的开头和文件的结尾？

score 4 · Accepted Answer

sed 是用于在单行上进行简单替换的出色工具，对于任何其他文本操作问题，只需使用 awk。

检查我在下面的 BEGIN 部分中使用的定义，以获取匹配 URL 的正则表达式。它适用于您的示例，但我不知道它是否捕获所有可能的 URL 格式。即使它没有，但它可能足以满足您的需求。

$ cat file
[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]
$
$ awk -f tst.awk file
[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]
$
$ cat tst.awk
BEGIN{ urlRe="http:[/][/][[:alnum:]._]+" }
{
    head = ""
    tail = $0
    while ( match(tail,urlRe) ) {
       url  = substr(tail,RSTART,RLENGTH)
       href = "href=\"" url "\""

       if (index(tail,href) == (RSTART - 6) ) {
          # this url is inside href="url" so skip processing it and the next url match.
          count = 2
       }

       if (! (count && count--)) {
          url = "<a " href ">" url "</a>"
       }

       head = head substr(tail,1,RSTART-1) url
       tail = substr(tail,RSTART+RLENGTH)
    }

    print head tail
}

score 2 · Accepted Answer

您的命令的明显问题是

You did not escape the parenthesis "("

sed这是正则表达式的奇怪之处。与 Perl 正则表达式不同的是，许多符号默认为“文字”。您必须将它们转义为“功能”。尝试：

s/\([^>"]\?\)\(http:\/\/[^\s]\+\)/\1<a href="\2">\2<\/a>/g

regex - sed 正则表达式可以模拟后视和前瞻吗？

2 回答 2

Related

Reference