2

I am using BeautifulSoup4 to scrape this web page, however I'm getting the weird unicode text that BeautifulSoup returns.

Here is my code:

    site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = urllib2.Request(site,headers=hdr)  
    req.add_header('Accept-enconding', 'gzip') #Header to check for gzip
    page = urllib2.urlopen(req)
    if page.info().get('Content-Encoding') == 'gzip': #IF checks gzip
        data = page.read()
        data = StringIO.StringIO(data)
        gzipper = gzip.GzipFile(fileobj=data)
        html = gzipper.read()
        soup = BeautifulSoup(html, fromEncoding='gbk')
    else:
        soup = BeautifulSoup(page)

    section = soup.find('span', id='Events').parent
    events = section.find_next('ul').find_all('li')
    print soup.originalEncoding
    for x in events:
        print x

Bascially I want x to be in plain English. I get, instead, things that look like this:

<li><a href="/wiki/153_BC" title="153 BC">153 BC</a> – <a href="/wiki/Roman_consul" title="Roman consul">Roman consuls</a> begin their year in office.</li>

There's only one example in this particular string, but you get the idea.

Related: I go on to cut up this string with some regex and other string cutting methods, should I switch this to plain text before or after I cut it up? I'm assuming it doesn't matter but seeing as I'm defering to SO anyways, I thought I'd ask.

If anyone knows how to fix this, I'd appreciate it. Thanks

EDIT: Thanks J.F. for the tip, I now used this after my for loop:

    for x in events:
        x = x.encode('ascii')
        x = str(x)
        #Find Content
        regex2 = re.compile(">[^>]*<")
        textList = re.findall(regex2, x)
        text = "".join(textList)
        text = text.replace(">", "")
        text = text.replace("<", "")
        contents.append(text)

However, I still get things like this:

2013 &#8211; At least 60 people are killed and 200 injured in a stampede after celebrations at F&#233;lix Houphou&#235;t-Boigny Stadium in Abidjan, Ivory Coast.

EDIT: Here is how I make my excel spreadsheet (csv) and send in my list

rows = zip(days, contents)
with open("events.csv", "wb") as f:
writer = csv.writer(f)
for row in rows:
    writer.writerow(row)

So the csv file is created during the program and everything is imported after the lists are generated. I just need to it to be readable text at that point.

4

2 回答 2

4

fromEncodingfrom_encoding(为了符合 PEP8已重命名为)告诉解析器如何解释输入中的数据。您(您的浏览器或 urllib)从服务器收到的只是一个字节流。为了理解它,即为了从这个字节流构建一个抽象字符序列(这个过程称为解码),人们必须知道信息是如何编码的。这条信息是必需的,您必须提供它以确保您的代码正常运行。维基百科告诉你他们是如何编码数据的,它在他们每个网页的源代码的顶部都有说明,例如

<meta charset="UTF-8" />

因此,从 Wikipedia 的 Web 服务器接收到的字节流应该使用 UTF-8 编解码器进行解释。你应该调用

soup = BeautifulSoup(html, from_encoding='utf-8')

而不是BeautifulSoup(html, fromEncoding='gbk'),它试图用一些汉字编解码器来解码字节流(我猜你是从这里盲目地复制了那段代码)。

您确实需要确保您了解文本编码的基本概念。实际上,您需要输出中的 unicode,它是一系列字符/符号的抽象表示。在这种情况下,没有“简单的英语”这样的东西。

于 2013-07-18T20:26:43.857 回答
2

没有纯文本这样的东西。您看到的是使用不正确的字符编码解释为文本的字节,即字符串的编码与终端使用的编码不同,除非之前通过对网页使用不正确的字符编码引入了错误。

print xstr(x)为 BeautifulSoup 对象返回 UTF-8 编码字符串的调用。

尝试:

print unicode(x)

或者:

print x.encode('ascii')
于 2013-07-18T20:19:58.220 回答