html - 删除 html/xml 的最简单方法从单线输出

Question

我有来自 grep 我正在尝试清理的输出，如下所示：

<words>Http://www.path.com/words</words>

我试过用...

sed 's/<.*>//'

...删除标签，但这只会破坏整行。我不确定为什么会这样，因为每个“<”在到达内容之前都用“>”关闭。

最简单的方法是什么？

谢谢！

score 8 · Accepted Answer

试试这个你的 sed 表达式：

sed 's/<.*>\(.*\)<\/.*>/\1/'

表达式的快速分解：

<.*>   - Match the first tag
\(.*\) - Match and save the text between the tags   
<\/.*> - Match the end tag making sure to escape the / character  
\1     - Output the result of the first saved match 
       -   (the text that is matched between \( and \))

更多关于反向引用

评论中出现了一个问题，为了完整起见，可能应该解决这个问题。

\(和是 Sed的\)反向引用标记。他们保存一部分匹配的表达式供以后使用。

例如，如果我们有一个输入字符串：

这里面有（括号）。此外，我们可以通过反向引用来使用类似 thisparens 的括号。

我们开发一个表达式：

sed s/.*(\(.*\)).*\1\\(.*\)\1.*/\1 \2/

这给了我们：

parens like this

这到底是怎么回事？让我们分解表达式来找出答案。

表达式分解：

sed s/ - This is the opening tag to a sed expression.
.*     - Match any character to start (as well as nothing).
(      - Match a literal left parenthesis character.
\(.*\) - Match any character and save as a back-reference. In this case it will match anything between the first open and last close parenthesis in the expression.
)      - Match a literal right parenthesis character.
.*     - Same as above.
\1     - Match the first saved back-reference. In the case of our sample this is filled in with `parens`
\(.*\) - Same as above.
\1     - Same as above.
/      - End of the match expression. Signals transition to the output expression.
\1 \2  - Print our two back-references.
/      - End of output expression.

(正如我们所看到的，从括号 (和)之间获取的反向引用)被替换回匹配表达式，以便能够匹配字符串parens。

html - 删除 html/xml 的最简单方法从单线输出

1 回答 1

Related

Reference