python - Bash/Python：打开 url 并打印前 10 个单词

Question

我需要使用管道（以及需要的任何其他 python 脚本）从文本中提取 10 个最常用的单词；输出是一个由空格分隔的全大写单词块。这个管道需要从任何外部文件中提取文本：我已经设法让它在 .txt 文件上工作，但我还需要能够输入一个 URL 并让它做同样的事情。

我有以下代码：

alias words="tr a-zA-Z | tr -cs A-Z | tr ' ' '\012' | sort -n | uniq -c | 
sort -r | head -n 10 | awk '{printf \"%s \", \$2}END{print \"\"}'" (on one line)

这cat hamlet.txt | words给了我：

TO THE AND A  'TIS THAT OR OF IS

为了使它更复杂，我需要排除任何“功能”词：这些是“非词汇”词，如“a”、“the”、“of”、“is”、任何代词（我、你、他） , 和任何介词（there, at, from）。

我需要能够htmlstrip http://www.google.com.au | words像上面那样输入并打印出来。

对于 URL 打开：我试图弄清楚的 python 脚本（我们称之为 htmlstrip）从文本中删除任何标签，只留下“人类可读”的文本。这应该能够打开任何给定的 URL，但我不知道如何让它工作。到目前为止我所拥有的：

import re
import urllib2
filename = raw_input('File name: ')
filehandle = open(filename)
html = filehandle.read()

f = urllib2.urlopen('http://') #???
print f.read()

text = [ ]
inTag = False


for ch in html:
    if ch == '<':
        inTag = True
    if not inTag:
        text.append(ch)
    if ch == '>':
        inTag = False

print ''.join(text)

我知道这既不完整，也可能不正确 - 任何指导都将不胜感激。

score 0 · Accepted Answer

更新：抱歉，刚刚阅读了关于纯 Python 的评论，没有任何额外的模块。是的，在这种情况下re，我认为，这将是最好的方法。

pycURL也许使用而不是删除标签会更容易，更正确re？

from StringIO import StringIO    
import pycurl

url = 'http://www.google.com/'

storage = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
content = storage.getvalue()
print content

score 0 · Accepted Answer

用于re.sub此：

import re

text = re.sub(r"<.+>", " ", html)

对于脚本等特殊情况，您可以包含正则表达式，例如：

<script.*>.*</script>

score 0 · Accepted Answer

你可以像这样使用scrape.py和正则表达式：

#!/usr/bin/env python

from scrape import s
import sys, re

if len(sys.argv) < 2:
    print "Usage: words.py url"
    sys.exit(0)

s.go(sys.argv[1]) # fetch content
text = s.doc.text # extract readable text
text = re.sub("\W+", " ", text) # remove all non-word characters and repeating whitespace
print text

然后只是： ./words.py http://whatever.com

python - Bash/Python：打开 url 并打印前 10 个单词

3 回答 3

Related

Reference