html - 如何将 HTML 中的标题和列表项内容提取到逗号分隔的列表中？

Question

遗憾的是，我的 regex-fu 缺乏，虽然我正在阅读“掌握 Regex”并阅读一些在线教程，但我一无所获，所以希望如果有人能给我一个针对我的情况的实际例子，它会帮助我开始。

输入文件大致如下所示：

<html>
 <head>
  <title>My Title</title>
 </head>
<body>
 <p>Various random text...</p>
 <ul>
  <li>One</li>
  <li><a href="example.com">Two</a></li>
  <li>Three</li>
 </ul>
 <p>Various random text...</p>
 </body>
</html>

我的最终目标是输出：

My Title,One,<a href="example.com">Two</a>,Three

例如，逗号分隔的值和标题，以及 li 标签的内容

第一步是尝试删除之前的所有内容（包括标题），所以当我决定使用 sed（我在 Windows 上运行 GNU sed 4.2 版）时，我尝试如下：

计算我需要匹配“所有内容”，包括标题标签的换行符并替换为任何内容：

用点匹配每个字符，还有换行符 /n 所以把它变成一个类并用 * 重复，这意味着 [.\n]* 后跟标题标签替换为空

所以

type file.html | sed "s/[.\n]*<title>//"

但这不起作用，它只是删除了字符串标题，而不是之前的内容。

我哪里错了？我想明白。

任何建议表示赞赏。提前致谢。

score 1 · Accepted Answer

使用 sed（和 tr，和 sed...）：

sed -n -e '/<title>\|<li>/{s/^[ ]*<[^>]*>//;s/<[^>]*>[ ]*$//p}' input | \
    tr '\n' , | sed 's/,$/\n/'

使用单个 sed 表达式：

sed ':a;N;$!ba;s/\n//g;        # loop, read-in all file, remove newlines 
     s/.*<title>//;            # remove everything up to, including <title>
     s/title>.*<ul>/title>/;   # remove everything between </title> and <ul>
     s!</ul>.*!!;              # remove everything after </ul>, inclusive
     s!</li>\|</title>!,!g;    # substitute closing tags with commas
     s/<li>//g;                # remove <li> tags
     s/,[ ]*$//                # delete the trailing comma
     ' input

score 0 · Accepted Answer

红宝石解决方案

你可以通过多种方式做你想做的事，有些方式比其他方式更优雅。这是一种使用单个 Ruby 单行代码获得预期结果的快速而简单的方法。

ruby -ne 'BEGIN { output = "" }
          output << $1 + ?, if %r{<(?:title|li)>(.*)</\1?}
          END { puts output.sub(/,$/, "") }' /tmp/foo.html

此脚本将以原始问题中描述的格式打印结果。例如，使用提供的示例文本打印：

My Title,One,<a href="example.com">Two</a>,Three

html - 如何将 HTML 中的标题和列表项内容提取到逗号分隔的列表中？

2 回答 2

红宝石解决方案

Related

Reference