0

I'm a beginner trying to build a script to pull an e-book from Project Gutenberg, break it into chapters and paragraphs and then do some basic analysis on the text.

I'd got as far as being able to reliably find the chapter titles because they are conveniently in 'h2' tags. However, since upgrading from Linux Mint Nadia to Olivia, only the first few tags are detected.

Some of the fine folks at reddit have been trying to help, but we've come to a dead end. However, the diagnostics we worked on are probably useful.

>>> import bs4
>>> from urllib import urlopen
>>> url = "http://www.gutenberg.org/files/82/82-h/82-h.htm"
>>> newtext = urlopen(url).read()
>>> soup = bs4.BeautifulSoup(newtext)
>>> def chap_list (htmlbook):
    print 'A:', len(htmlbook)
    soup = bs4.BeautifulSoup(htmlbook)
    print 'B:', len(soup)
    chapters = soup('h2')
    print 'C:', chapters
    return

>>> chap_list(newtext)

For me, this returns:

A: 1317420
B: 2
C: [<h2>
      A ROMANCE
    </h2>, <h2>
      By Sir Walter Scott
    </h2>, <h2>
      INTRODUCTION TO IVANHOE.
    </h2>, <h2>
      DEDICATORY EPISTLE
    </h2>]

Also, now when I simply call the soup object as defined above, only the first part of the book is returned - up to "Residing in the Castle-Gate, York." I'm sure this used to return the entire text. Therefore my assessment is that BS is no longer pulling in the entire text.

Versions: Python 2.74 BeautifulSoup 4.2.1 lxml 3.1.0

I don't know the versions I was using while everything worked. Tried running under Python 3.3 got the same results. I need to use 2.7 because I want the nltk a bit later on.

Can anyone help me get this working again?

4

1 回答 1

0

我用 BeautifulSoup 3 尝试了这段代码,它似乎工作:

In [1]: import BeautifulSoup
In [2]: from urllib import urlopen
In [3]: html = urlopen('http://www.gutenberg.org/files/82/82-h/82-h.htm').read()
In [4]: soup = BeautifulSoup.BeautifulSoup(html)
In [5]: len(soup('h2'))
Out [5]: 58

请注意,第 4 版在后台使用了不同的 HTML 解析器,我记得在某处读到第 3 版可以处理更多。也许您可以使用不同的 HTML 解析器尝试版本 4。

于 2013-06-09T14:04:57.097 回答