linux - 使用 shell 脚本搜索单词并在该单词之后导出 35 个字符？

Question

我有一个文件input.txt，其中包含大量奇怪的字符、html 标签和有用的材料。我想在新文件 output.txt中显示 35 个字符，description不包括奇怪的字符，如和没有 html 标记。$$#$#@$#@***$#帮我。提前谢谢。

我的最终目标是找到单词描述并在描述后打印 35 个字符，其中不应包含 html 标签和奇怪的字符。可能吗？像这儿：

<description>&lt;p&gt;&lt;img class="float_right"
 src="http://static3.businessinsider.com/image/502ab0036bb3f7147b00000f-400-300/dnu.jpg"
 border="0" alt="dnu" width="400" height="300" /&gt;&lt;/p&gt;&lt;p&gt;The lawn
 was filled with &lt;a class="hidden_link"
 href="http://www.businessinsider.com/blackboard/goldman-sachs"&gt;Goldman
 Sachs&lt;/a&gt; Group Inc. partners dressed in pink looking out on a pink sunset.

我想从：（The lawn is filled with再次跳过这些标签并继续）Group Inc. partners（35 个字符。完成！）然后停止并搜索另一个描述！

score 1 · Accepted Answer

您可以使用 XPath 选择 HTML 节点中的所有文本。在您的情况下，这应该有效：

xpath -q -e '//description//text()' input.txt

查询的//description//text()工作方式如下：

//description: 向下钻取 HTML 文档，直到找到一个名为description
//text()：在此节点中向下钻取所有其他节点并选择它们的文本

鉴于您的数据，此输出：

The lawn was filled with 
Goldman Sachs
 Group Inc. partners dressed in pink looking out on a pink sunset.

linux - 使用 shell 脚本搜索单词并在该单词之后导出 35 个字符？

1 回答 1

Related

Reference