html - 从 HTML 字符串中删除所有间距

Question

我正在尝试实现删除所有空白和空格字符的代码，然后计算页面中出现的前 3 个字母数字字符。我的问题是双重的。

1）我用于拆分的方法似乎不起作用，我不确定为什么它不起作用。据我所知，加入然后拆分应该从 html 源代码中删除所有空格和空格，但事实并非如此（请参阅下面亚马逊示例的第一个返回值）。

2) 我对 most_common 操作不是很熟悉，当我在“ http://amazon.com ”上测试我的代码时，我得到以下输出：

The top 3 occuring alphanumeric characters in the html of http://amazon.com 
:  [(u' ', 258), (u'a', 126), (u'e', 126)]

返回的 most_common(3) 值中的 u 是什么意思？

我当前的代码：

import requests
import collections


url = raw_input("please eneter the url of the desired website (include http://): ")

response = requests.get(url)
responseString = response.text

print responseString

topThreeAlphaString = " ".join(filter(None, responseString.split()))

lineNumber = 0

for line in topThreeAlphaString:
    line = line.strip()
    lineNumber += 1

topThreeAlpha = collections.Counter(topThreeAlphaString).most_common(3)

print "The top 3 occuring alphanumeric characters in the html of", url,": ", topThreeAlpha

score 0 · Accepted Answer

要处理空白，您需要使用HTMLParser.HTMLParser的实例及其unescape方法来消除周围的任何原始 HTML 字符。要计算字符，您应该查看collections.Counter。

import requests
from collections import Counter
from HTMLParser import HTMLParser

response = requests.get('http://www.example.com')
responseString = response.text

parser = HTMLParser()
c = Counter(''.join(parser.unescape(responseString).split())

print(c.most_common()[:3])

html - 从 HTML 字符串中删除所有间距

1 回答 1

Related

Reference