python - 如何在 python 中从 HTML 页面中剥离整个 HTML、CSS 和 JS 代码或标签

Question

可能重复：
BeautifulSoup Grab Visible Webpage Text
使用 Python抓取网页

假设我是一个非常复杂的 HTML 页面，中间包含常用的 HTML 标签、CSS 和 JS。我们可能会看到所有最坏的情况。

我想要的只是剥离所有上述标签/代码并返回“文本”。

简单来说：

<html><body>Text</body></html>

这可能包含 JS、CSS 等等等。

我正在尝试使用 BeautifulSoup，但它没有从代码中删除 JS.. 现在，我正在考虑使用 Regex.. 但不知道该怎么做

编辑1

这是我在一个简单的引导 html 页面上的尝试...

from bs4 import BeautifulSoup as bs
import requests

bs( requests.get(MY-URL).text ).get_text()

$ 返回文本

html
Home
Le styles
body {
        padding-top: 10%;
        padding-left: 30%;
      }
HTML5 shim, for IE6-8 support of HTML5 elements
[if lt IE 9]>
      <script src="http://htm...html5.js"></script>
    <![endif]
Home | Under Construction
Sample Page 1
The app
might
face some ........
Firefox
. Ple..
/container
var _gaq = _gaq || [];

  _gaq.push(['_trackPageview']);

  (function() {
    var ga = do...............
  })();

score 1 · Accepted Answer

Django 使用这个函数从文本中去除标签：

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    return re.sub(r'<[^>]*?>', '', force_unicode(value))

（你不需要 force_unicode 部分）

python - 如何在 python 中从 HTML 页面中剥离整个 HTML、CSS 和 JS 代码或标签

1 回答 1

Related

Reference