regex - Sed 用架构标记包围徽标

Question

我正在尝试替换：

   <td id="logo_divider"><a href="http://www.the-site.com"><img src=
   "/ART/logo.140.gif" width="140" height="84" alt="logo" border=
   "0" id="logo" name="logo" /></a></td>

和：

   <td id="logo_divider"><span itemscope itemtype="http://schema.org/Organization"><a itemprop="url" href="http://www.the-site.com"><img itemprop="logo" src=
   "/ART/logo.140.gif" width="140" height="84" alt="logo" border=
   "0" id="logo" name="logo" /></a></span></td>

我写的 sed 命令：

sed -E s#\(\<td id=\"logo_divider\"\>\)\(\<a \)\(href=\"http://www\.the-site\.com\"\>\<img \)\(src=\n\"/ART/logo\.140\.gif\".*?\n.*?\>\)#\1\<span itemscope itemtype=\"http://schema\.org/Organization\"\>\2itemprop=\"url\"\3itemprop=\"logo\"\4\</span\>\5#g default.ctp

有两个问题。第一个是命令失败：

sed: 1: "s#(<td": unterminated substitute pattern

第二个是，即使要成功，匹配也需要对换行符具有鲁棒性。更强大的解决方案将首先删除以下之间的任何换行符：

<td id="logo_divider">

和：

</td>

然后对清理的文件执行替换。就像是：

sed -E s#\n##g | ...

score 3 · Accepted Answer

正如chepner所说，为正确的工作使用正确的工具。

如果你有任何 Python，我推荐Beautiful Soup - 相对简单地得到你想要的东西（这是粗鲁和粗鲁的，但假设你在 somefile.html 中有上述源代码，你就会明白）：

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("./somefile.html"))

td = soup.find('td',id='logo_divider')
anchor = td.find('a')
anchor['itemprop'] = 'url'
span = soup.new_tag('span')
span['itemscope'] = True
span['itemtype'] = 'http://schema.org/Organization'
spanchild = anchor.replace_with(span)
span.append(spanchild)

regex - Sed 用架构标记包围徽标

1 回答 1

Related

Reference