python - 从python wikipedia页面输出中检索主要段落

Question

有什么方法可以提取

阿尔伯特·爱因斯坦（/ˈælbərt ˈaɪnstaɪn/；德语：ˈalbɐt ˈaɪnʃtaɪn；1879 年 3 月 14 日至 1955 年 4 月 18 日）是一位德国出生的理论物理学家，他发展了广义相对论，影响了物理学的一场革命。............. 超过 150 篇非科学作品。[6][8] 他的巨大智慧和独创性使“爱因斯坦”这个词成为天才的代名词。[9]

（主要段落的整个输出，如果代码运行可见）

自动从以下代码的输出？即使它是从不同的维基百科页面输出的：

import urllib2
import re, sys
from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def stripHTMLTags(html):
    html = re.sub(r'<{1}br{1}>', '\n', html)
    s = MLStripper()
    s.feed(html)
    text = s.get_data()
    if "External links" in text:
        text, sep, tail = text.partition('External links')
    if "External Links" in text:
        text, sep, tail = text.partition('External Links')
    text = text = text.replace("See also","\n\n See Also - \n")
    text = text.replace("*","- ")
    text = text.replace(".", ". ")
    text = text.replace("  "," ")
    text = text.replace("""   /
 / ""","")
    return text

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()
print stripHTMLTags(page)

请原谅我糟糕的格式、代码（可能还有缩进），我现在使用的是 3" 显示器并且没有机会检查我自己的代码：P。

还要感谢那些帮助我完成这项工作的人:)

score 3 · Accepted Answer

我强烈建议不要对任何网站进行 html 抓取。

这样做很痛苦，很容易损坏，很多网站所有者不喜欢它。

使用它（python-wikitools）与 Wikipedia API（从长远来看是您的最佳选择）进行交互。

score 1 · Accepted Answer

以下 API 请求返回纯文本页面提取： https ://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&explaintext

score -1 · Accepted Answer

我在这里留下我的答案，因为它直接是 OP 要求的。执行此操作的正确方法是按照下面@ChristophD 的答案python-wikitools中的建议使用。

我稍微修改了您问题中的代码以使用BeautifulSoup。存在其他选项。您可能还想尝试lxml。

import urllib2
import re, sys
from HTMLParser import HTMLParser

# EDIT 1: import the packag
from BeautifulSoup import BeautifulSoup

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def stripHTMLTags(html):
    html = re.sub(r'<{1}br{1}>', '\n', html)
    s = MLStripper()
    s.feed(html)
    text = s.get_data()
    if "External links" in text:
        text, sep, tail = text.partition('External links')
    if "External Links" in text:
        text, sep, tail = text.partition('External Links')
    text = text = text.replace("See also","\n\n See Also - \n")
    text = text.replace("*","- ")
    text = text.replace(".", ". ")
    text = text.replace("  "," ")
    text = text.replace("""   /
 / ""","")
    return text

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()

# EDIT 2: convert the page and extract text from the first <p> tag
soup = BeautifulSoup(page)
para = soup.findAll("p", limit=1)[0].text

print stripHTMLTags(para)

python - 从python wikipedia页面输出中检索主要段落

3 回答 3

Related

Reference