html - 正则表达式：除标签之外的所有内容

Question

我尝试获取除 html 标签之外的所有字符串/文本。前任。

<html><head><title>test</title></head><body><p>hi there</p></body></html>
 -->
"test hi there"

首先，我尝试创建一个查找所有 html 标签的正则表达式：(<.*?>). 之后我尝试反转正则表达式((?!<.*?>).)*- 但这个表达式不起作用:(有人可以帮助我吗？

score 1 · Accepted Answer

除了匹配标签之外的所有内容，您应该尝试仅匹配标签并将它们从字符串中删除，以便留下您的结果。

var str = "<html><head><title>test</title></head><body><p>hi there</p></body></html>";
var res = str.replace(/(<[^>]+>)+/g, " ");

您可能需要.replace(/\s+/g, " ")修剪结果以获得预期的输出。

顺便说一句，尝试将所有 HTML 语法与正则表达式匹配是一个坏主意。相反，您可能希望使用DOM 解析器并获取textContent结果文档的。

score 1 · Accepted Answer

这是您想要的正则表达式模式：

>([^<]*)<

使用正则表达式匹配，您将获得一个字符串数组。如果您将所有偶数字符串放在一起（如下所示），您将得到您想要的。有关更多信息，请参阅此。

//This is not a real language!
//Syntax based on Java and Javascript

String function getHtmlText(String html) {
 String str = "";
 String[] arr = match(html, "/>([^<]*)</");
 Int i = arr.length;

 while(i) {
  str += arr[i];
  i   -= 2; //Because we want every even value
 }

 return str;
}

或者使用 DOM 元素的textContent属性。看到这个。

希望它有帮助，m93a :D

score 0 · Accepted Answer

使用下面的表达式并将所有标签替换为空字符串“”

(\<[A-Za-z =":/.]+\>)|(\</[A-Za-z]+\>)

如果 HTML 标签是

<B>Bold 
<P>This is a sample text</P>
</B>
<A HREF="http://www.google.com">Click Here</A>

用空字符串替换上面的表达式会产生以下结果

Bold 
This is a sample text

Click Here

html - 正则表达式：除标签之外的所有内容

3 回答 3

Related

Reference