1

我可以毫无问题地打印我从网站中提取的信息。但是,当我尝试将街道名称放在一列中并将邮政编码放在另一列中的 CSV 文件中时,这就是我遇到问题的时候。我在 CSV 中得到的只是两个列名以及页面上单独列中的所有内容。这是我的代码。我也在使用 Python 2.7.5 和 Beautiful soup 4

from bs4 import BeautifulSoup
import csv
import urllib2

url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/"

page=urllib2.urlopen(url)

soup = BeautifulSoup(page.read())

f = csv.writer(open("Defiance Steets1.csv", "w"))
f.writerow(["Name", "ZipCodes"]) # Write column headers as the first line

links = soup.find_all(['i','a'])

for link in links:
    names = link.contents[0]
    print unicode(names)

f.writerow(names)   
4

2 回答 2

2

另一种方法 ( python3) 是在每个链接之后查找下一个兄弟<a>,检查它是否是标签并提取其值:

from bs4 import BeautifulSoup
import csv 
import urllib.request as urllib2

url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/"

page=urllib2.urlopen(url)

soup = BeautifulSoup(page.read())

f = csv.writer(open("Defiance Steets1.csv", "w"))
f.writerow(["Name", "ZipCodes"]) # Write column headers as the first line

links = soup.find_all('a')

for link in links:
    i = link.find_next_sibling('i')
    if getattr(i, 'name', None):
        a, i = link.string, i.string
        f.writerow([a, i])

它产生:

Name,ZipCodes
1ST ST,(43512)
E 1ST ST,(43512)
W 1ST ST,(43512)
2ND ST,(43512)
E 2ND ST,(43512)
W 2ND ST,(43512)
3 RIVERS CT,(43512)
3RD ST,(43512)
E 3RD ST,(43512)
W 3RD ST,(43512)
...
于 2013-10-28T16:00:41.267 回答
2

您从 URL 检索的数据包含的a元素多于i元素。您必须过滤a元素,然后使用 Python 内置构建对zip

links = soup.find_all('a')
links = [link for link in links
         if link["href"].startswith("http://www.conakat.com/map/?p=")]
zips = soup.find_all('i')

for l, z in zip(links, zips):
    f.writerow((l.contents[0], z.contents[0]))

输出:

Name,ZipCodes
1ST ST,(43512)
E 1ST ST,(43512)
W 1ST ST,(43512)
2ND ST,(43512)
E 2ND ST,(43512)
W 2ND ST,(43512)
3 RIVERS CT,(43512)
3RD ST,(43512)
E 3RD ST,(43512)
...
于 2013-10-28T15:50:10.810 回答