我有以下代码来抓取数据。数据正在被抓取。但是输出有点混乱。
from bs4 import BeautifulSoup
import urllib2
import re
import csv
with open('ccccc.csv', 'wb') as f:
writer = csv.writer(f, quoting=csv.QUOTE_ALL)
for i in xrange(1,3):
try:
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg={}".format(i))
except urllib2.HTTPError:
continue
else:
soup = BeautifulSoup(page.read(), from_encoding=page.info().getparam('charset'))
eachbox = soup.find_all('div', {'class':re.compile(r'members_box[12]')})
for pair in zip(*[iter(eachbox)]*2):
writer.writerow([text.strip() for item in pair for text in item.stripped_strings])
在我添加的图像中,您会看到列不匹配。
这是我正在抓取的数据的结构
<div class="members_box_second">
<div class="members_box0">
<p>1</p>
</div>
<div class="members_box1">
<p class="clear"><b>Name:</b><span>Mr.Jagadhesan.S</span></p>
<p class="clear"><b>Designation:</b><span>Proprietor</span></p>
<p class="clear"><b>CODISSIA - Designation:</b><span>(Founder President, CODISSIA)</span></p>
<p class="clear"><b>Name of the Industry:</b><span>Govardhana Engineering Industries</span></p>
<p class="clear"><b>Specification:</b><span>LIFE</span></p>
<p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
</div>
<div class="members_box2">
<p>Ukkadam South</p>
<p class="clear"><b>Phone:</b><span>2320085, 2320067</span></p>
<p class="clear"><b>Email:</b><span><a href="mailto:jagadhesan@infognana.com">jagadhesan@infognana.com</a></span></p>
</div>
</div>
<div class="members_box">
<div class="members_box0">
<p>2</p>
</div>
<div class="members_box1">
<p class="clear"><b>Name:</b><span>Mr.Somasundaram.A</span></p>
<p class="clear"><b>Designation:</b><span>Proprietor</span></p>
<p class="clear"><b>Name of the Industry:</b><span>Everest Engineering Works</span></p>
<p class="clear"><b>Specification:</b><span>LIFE</span></p>
<p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
</div>
<div class="members_box2">
<p>Alagar Nivas, 284 NSR Road</p>
<p class="clear"><b>Phone:</b><span>2435674</span></p>
<h4>Factory Address</h4>
Coimbatore - 641 027
<p class="clear"><b>Phone:</b><span>2435674</span></p>
</div>
</div>
我希望将数据放在相应的列中。例如,所有的名字都应该在同一个列名中,比如 wise phone no 和 email 等等。如果 Phone no 不存在,它应该在 csv 文件上留下一个空格。我什至不接近得到实现它的想法。