html - Windows - 如何 grep（或 findstr）html 文件并显示第一个匹配的表达式

Question

使用 grep 或 findstr 我想在通过真实姓名搜索特定电影时获得正确的 IMDB 编号。

例如，电影“Das Boot”在 IMDB 上以电影编号 tt0082096 列出。

实际上，我正在尝试通过搜索机生成的 html 文件来 grep（或 findstr）。

生成的 html 文件包含如下几个部分：

<div id="statbox"> 
  <span class="uschr2">1. </span> <a href="http://www.imdb.com/title/tt0082096/" class="dublaulink">Das Boot (1981) - IMDb</a> <br>
  <div id="descbox"> 
  www.imdb.com/title/tt0082096/ - Im Cache - Ähnliche Seiten <BR>
  </div>

我要查找的字符串是包含电影 URL 的字符串。在这种情况下，它是：

http://www.imdb.com/title/tt0082096/

字符串格式如下：

http://www.imdb.com/title/tt???????/

在哪里 '？' 代表数字 0...9

我的问题是：如何 grep 或 findstr 只返回匹配字符串本身的第一次出现而不是包含匹配的完整行？

非常感谢您的帮助！此致

score 3 · Accepted Answer

Windowsfindstr返回完整的行。您可以使用 GNU sed避免这种情况：

sed -rn "\#http://www.imdb.com/title/tt#s#.*href=\"(.*)\"\s.*#\1#p" file
http://www.imdb.com/title/tt0082096/

此外，您可以使用grep -o：

  -o, --only-matching       show only the part of a line matching PATTERN

score 2 · Accepted Answer

你grep可以这样做：

grep -oP '(?<=href=\")[^"]+(?=\")' html.file

这不是解析 html 文件的理想方式。但是，如果它是一次性的，那么您可能会侥幸逃脱。?<=href=\"是在搜索后面看。如果上面它返回了很多东西，那么您可能可以添加 url 行独有的内容。

html - Windows - 如何 grep（或 findstr）html 文件并显示第一个匹配的表达式

2 回答 2

Related

Reference