html - 使用 SED 去除 HTML 内容

Question

我正在处理一项指定工具为 SED 的任务。任务是去除任何网页文件（*.htm 或 *.html）的内容，并将所需数据插入到新文件中。

标记之前和包括<body>标记在内的所有内容都将被删除。
包括</body>标签在内的所有内容都将被删除。

下面是一个示例，其中<div>要保留标签以及它们之间的内容：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>SED Challange</title>
</head>
<body style="background-color:black;"><div style="width:100%; height:150px; margin-top:150px; text-align:center">
<img src="pic.png" width="50" height="50" alt="Pic alt text" />
</div></body></html>

但是，我在删除<body>之前遇到了麻烦：

sed 's/.*body.*>//' ./index.html > ./index.html.nobody

<body>而不是期望的结果，包含和的两个单独的行</body>被删除！

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>SED Challange</title>
</head>

<img src="pic.png" width="50" height="50" alt="Pic alt text" />

我不明白为什么会有一个。我很感激任何反馈。

编辑：

感谢 SLePort，这是我的完整脚本：

#!/bin/bash

#Search location as user provided argument.
target="$1"

#Recursive, case insensitive search for file extension like htm(l).
hit=$(find $target -type f -iname '*.htm' -or -iname '*.html')

for h in $hit
do
    hp=$(realpath $h) #Absolute path of file (hit path).
    echo "Stripping performed on $hp" #Informing what file(s) found.
    nobody="${hp}_nobody" #File to contain desired data ending with "_nobody".

    #Remove file contents from start to and including head-tag, 
    #Remove body-tag,
    #Remove end html-tag,
    #Removee blank lines,
    #Insert data from file to file_nobody.
    sed '1,/<\/head>/d;s/<\/*body[^>]*>//g;s/<\/html>//;/^$/d' $h > $nobody 
done

score 0 · Accepted Answer

这个 sed 应该使用给定的代码：

sed '1,/<\/head>/d;s/<\/*body[^>]*>//g;s/<\/html>//' ./index.html > ./index.html.nobody

它删除：

从第 1 行到</head>标记的行
<body>和</body>标签
</html>结束标签

但请注意，sed 不适用于解析 html 文件。改用 xml 解析器（例如：xmllint，XMLStarlet，...）

html - 使用 SED 去除 HTML 内容

1 回答 1

Related

Reference