-2
def dcrawl(link):
    #importing the req. libraries & modules
    from bs4 import BeautifulSoup
    import urllib

    #fetching the document
    op = urllib.FancyURLopener({})
    f = op.open(link)
    h_doc = f.read()

    #trimming for the base document
    idoc1 = BeautifulSoup(h_doc)
    idoc2 = str(idoc1.find(id = "bwStory"))
    bdoc = BeautifulSoup(idoc2)

    #extract the date as a string
    dat = str(bdoc.div.div.string)[0:13]
    date = dst(dat)

    #extract the title as a string
    title = str(bdoc.b.string)
    #extract the full report as a string
    freport = str(bdoc.find_all("p"))

    #extract the place as a string
    plc = bdoc.find(id = "bwStoryBody")
    puni = plc.p.string

    #encoding to ascii to eliminate discrepancies
    pasi = puni.encode('ascii', 'ignore')
    com = pasi.find("-")
    place = pasi[:com]

相同的转换“bdoc.b.string”在这里有效:

#extract the full report as a string
freport = str(bdoc.find_all("p"))

在行中:

plc = bdoc.find(id = "bwStoryBody")

plc返回一些数据。并plc.p返回第一个<p>....<p>,但将其转换为字符串不起作用。

因为puni之前返回了一个字符串对象,我偶然发现了 unicode 错误,因此不得不使用编码来处理pasi结果。

4

1 回答 1

0

.find()找到None对象时返回。显然有些页面没有您要查找的元素。

如果要防止属性错误,请显式测试它:

plc = bdoc.find(id = "bwStoryBody")
if plc is not None:
    puni = plc.p.string
    #encoding to ascii to eliminate discrepancies
    #By default python processes in unicode
    pasi = puni.encode('ascii', 'ignore')
    com = pasi.find("-")
    place = pasi[:com]
于 2013-10-23T07:39:31.347 回答