html - 从 HTML 中提取文本的正则表达式

Question

我想从一般 HTML 页面中提取所有文本（显示或不显示）。

我想删除

任何 HTML 标签
任何 javascript
任何 CSS 样式

是否有一个正则表达式（一个或多个）可以实现这一点？

score 19 · Accepted Answer

19

删除 javascript 和 CSS：

<(script|style).*?</\1>

删除标签

<.*?>

于 2008-10-08T01:53:36.973 回答

score 14 · Accepted Answer

您无法真正使用正则表达式解析 HTML。这太复杂了。RE 根本无法<![CDATA[正确处理部分。此外，某些常见的 HTML 内容（例如<text>，在浏览器中可以作为正确的文本工作），但可能会使天真的 RE 感到困惑。

使用适当的 HTML 解析器，您会更快乐、更成功。Python 人经常使用Beautiful Soup来解析 HTML 并去除标签和脚本。

此外，浏览器在设计上允许格式错误的 HTML。因此，您经常会发现自己试图解析显然不正确的 HTML，但在浏览器中却可以正常工作。

您可能能够使用 RE 解析错误的 HTML。它所需要的只是耐心和努力工作。但是使用别人的解析器通常更简单。

score 6 · Accepted Answer

需要一个正则表达式解决方案（在 php 中），它返回的纯文本与 PHPSimpleDOM 一样好（或更好），只是速度快得多。这是我想出的解决方案：

function plaintext($html)
{
    // remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
    $plaintext = preg_replace('#<!--.*?-->#s', '', $html);

    // put a space between list items (strip_tags just removes the tags).
    $plaintext = preg_replace('#</li>#', ' </li>', $plaintext);

    // remove all script and style tags
    $plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);

    // remove br tags (missed by strip_tags)
    $plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);

    // remove all remaining html
    $plaintext = strip_tags($plaintext);

    return $plaintext;
}

当我在一些复杂的网站（论坛似乎包含一些更难解析的 html）上测试这个时，这个方法返回了与 PHPSimpleDOM 纯文本相同的结果，只是快得多。它还正确处理了列表项（li 标签），而 PHPSimpleDOM 没有。

至于速度：

SimpleDom：0.03248 秒。
正则表达式：0.00087 秒。

快 37 倍！

score 4 · Accepted Answer

考虑用正则表达式来做这件事是令人生畏的。你考虑过 XSLT 吗？提取 XHTML 文档中所有文本节点（减去脚本和样式内容）的 XPath 表达式将是：

//body//text()[not(ancestor::script)][not(ancestor::style)]

score 2 · Accepted Answer

使用 perl 语法来定义正则表达式，开始可能是：

!<body.*?>(.*)</body>!smi

然后将以下替换应用于该组的结果：

!<script.*?</script>!!smi
!<[^>]+/[ \t]*>!!smi
!</?([a-z]+).*?>!!smi
/<!--.*?-->//smi

这当然不会很好地将内容格式化为文本文件，但它会删除所有 HTML（大多数情况下，它可能无法正常工作）。一个更好的主意是使用 XML 解析器以您使用的任何语言来正确解析 HTML 并从中提取文本。

score 2 · Accepted Answer

简单 HTML 的最简单方法（Python 中的示例）：

text = "<p>This is my> <strong>example</strong>HTML,<br /> containing tags</p>"
import re
" ".join([t.strip() for t in re.findall(r"<[^>]+>|[^<]+",text) if not '<' in t])

返回这个：

'This is my> example HTML, containing tags'

score 2 · Accepted Answer

这是一个删除最复杂的html标签的功能。

function strip_html_tags( $text ) 
{

$text = preg_replace(
    array(
        // Remove invisible content
        '@<head[^>]*?>.*?</head>@siu',
        '@<style[^>]*?>.*?</style>@siu',
        '@<script[^>]*?.*?</script>@siu',
        '@<object[^>]*?.*?</object>@siu',
        '@<embed[^>]*?.*?</embed>@siu',
        '@<applet[^>]*?.*?</applet>@siu',
        '@<noframes[^>]*?.*?</noframes>@siu',
        '@<noscript[^>]*?.*?</noscript>@siu',
        '@<noembed[^>]*?.*?</noembed>@siu',

        // Add line breaks before & after blocks
        '@<((br)|(hr))@iu',
        '@</?((address)|(blockquote)|(center)|(del))@iu',
        '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
        '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
        '@</?((table)|(th)|(td)|(caption))@iu',
        '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
        '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
        '@</?((frameset)|(frame)|(iframe))@iu',
    ),
    array(
        ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
        "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
        "\n\$0", "\n\$0",
    ),
    $text );

// Remove all remaining tags and comments and return.
return strip_tags( $text );
    }

score 1 · Accepted Answer

如果您使用的是 PHP，请尝试 SourceForge 上提供的 Simple HTML DOM。

否则，谷歌 html2text，你会发现不同语言的各种实现，它们基本上使用一系列正则表达式来吸出所有标记。这里要小心，因为有时会留下没有结尾的标签，以及特殊字符，例如 &（即 &）。

另外，请注意注释和 Javascript，因为我发现处理正则表达式特别烦人，以及为什么我通常更喜欢让免费的解析器为我完成所有工作。

score 1 · Accepted Answer

我相信你能做到

document.body.innerText

它将返回文档中所有文本节点的内容，无论是否可见。

[编辑（olliej）：叹息没关系，这只适用于 Safari 和 IE，我也懒得每晚下载一个 firefox 来查看它是否存在于主干中：-/]

score 1 · Accepted Answer

你不能只使用 C# 提供的 WebBrowser 控件吗？

        System.Windows.Forms.WebBrowser wc = new System.Windows.Forms.WebBrowser();
        wc.DocumentText = "<html><body>blah blah<b>foo</b></body></html>";
        System.Windows.Forms.HtmlDocument h = wc.Document;
        Console.WriteLine(h.Body.InnerText);

score 1 · Accepted Answer

string decode = System.Web.HttpUtility.HtmlDecode(your_htmlfile.html);
                Regex objRegExp = new Regex("<(.|\n)+?>");
                string replace = objRegExp.Replace(g, "");
                replace = replace.Replace(k, string.Empty);
                replace.Trim("\t\r\n ".ToCharArray());

then take a label and do "label.text=replace;" see on label out put

.

html - 从 HTML 中提取文本的正则表达式

11 回答 11

至于速度：

Related

Reference