1

我编写了以下脚本来从 download.com 上抓取一些数据:

from bs4 import BeautifulSoup
import urllib2
import csv

query = raw_input("Please enter the query value: ")
limit = raw_input("Please enter the results limit (1-100): ")
minreview = raw_input("Please enter the min user rating (1-5): ")
maxreview = raw_input("Please enter the max user rating (1-5): ")
csvname = raw_input("Enter a filename.csv for CSV output: ")

cnetFile = urllib2.urlopen("http://developer.api.cnet.com/rest/v1.0/softwareProductSearch?partKey=APIKEYGOESHERE&partTag=APIKEYGOESHERE&query=" + query + "$
cnetXml = cnetFile.read()
cnetFile.close()

soup = BeautifulSoup(cnetXml, features="xml")
#print soup.prettify()

f = csv.writer(open(csvname, "w"))
f.writerow(["Name", "Link", "Mfg", "Mfg Link", "Price", "Downloads", "User Rating Summary", "User Rating Product"])

data = soup.find_all(['Name', 'Price', 'TotalDownloads', 'LinkURL', 'Rating'])
#print data

for x in data:
        strip1 = x.contents
        print strip1
        f.writerow(strip1)

返回 2 个产品的 CSV 输出如下所示:(应该为每个产品返回八个字段,如代码中的标题中一样,但偶尔会丢失一个字段,例如“2”下的第一个产品中的第八个字段。)

Name,Link,Mfg,Mfg Link,Price,Downloads,User Rating Summary,User Rating Product

Firegraphic

http://www.download.com/firegraphic/3000-2192_4-10367545.html?tag=api

Firegraphic

http://www.firegraphic.com

$49.95

2546868

2.0



MP3 CD Maker

http://www.download.com/mp3-cd-maker/3000-2140_4-10065486.html?tag=api

ZY Computing

http://www.dvdsanta.com

$24.95

1653394

2.0

2.0

以下是汤变量中的数据示例:

<?xml version="1.0" encoding="utf-8"?>
<CNETResponse realm="cnet" version="1.0" xmlns="http://developer.api.cnet.com/re
st/v1.0/ns" xmlns:xlink="http://www.w3.org/1999/xlink">
<SoftwareProducts numFound="898" numReturned="2" start="0">
<SoftwareProduct id="11889531" setId="10367545" xlink:href="http://developer.api
.cnet.com/rest/v1.0/softwareProduct?productSetId=10367545&amp;iod=userRatings&am
p;partKey=572kjgq8h2mqbsup36cubkys&amp;partTag=572kjgq8h2mqbsup36cubkys">
<Name>Firegraphic</Name>
<Version>11.0</Version>
<LinkURL>http://www.download.com/firegraphic/3000-2192_4-10367545.html?tag=api</
LinkURL>
<Publisher id="6268727">
<Name>Firegraphic</Name>
<LinkURL>http://www.firegraphic.com</LinkURL>
<UrsRegId/>
</Publisher>
<License>Free to try</License>
<BetaRelease>false</BetaRelease>
<Price currency="USD">$49.95</Price>
<Summary>Import, organize, view, edit, print, and share your digital images.</Su
mmary>
<Description>&lt;p&gt;Firegraphic is an image viewer for photography professiona
ls, Web, and graphic designers to import, organize, view, edit, print, and share
 their digital images. The new Firegraphic has improved its memory usage and con
sumes very low memory, which leaves more memory for you to edit your photos in t
he image editor. Firegraphic now supports the RAW file formats from digital came
ras. Firegraphic gives you the ability to open multiple photos in the Viewer and
 compare photos side-by-side to choose your best shot. You also can customize th
e tools in your toolbar and the Context menu in the Viewer. The Firegraphic user
 interface lets you change the skin color and edit photos with a third-party ima
ge editor.&lt;/p&gt;</Description>
<WhatsNew/>
<Requirements> </Requirements>
<Platform>Windows</Platform>
<OperatingSystems>
<OperatingSystem id="3">Windows</OperatingSystem>
<OperatingSystem id="17">Windows 2000</OperatingSystem>
<OperatingSystem id="25">Windows XP</OperatingSystem>
<OperatingSystem id="43">Windows 2003</OperatingSystem>
<OperatingSystem id="52">Windows Vista</OperatingSystem>
<OperatingSystem id="133">Windows 7</OperatingSystem>
</OperatingSystems>
<EditorsRating outOf="5">3.0</EditorsRating>
<EditorsNote/>
<PreferredNode id="2192"/>
<WeeklyDownloads>8</WeeklyDownloads>
<TotalDownloads>2546868</TotalDownloads>
<CreatedDate>2011-04-21 17:41:19.0</CreatedDate>
<ReleaseDate>2011-04-21 00:00:00.0</ReleaseDate>
<ReviewDate>2008-11-09 00:00:00.0</ReviewDate>
<Limitations>30-day trial</Limitations>
<BuyNowUrl type=""> </BuyNowUrl>
<TrialPayUrl/>
<CleverBridgeUrl/>
<UpsellUnit/>
<ButtonPartner/>
<CNETContentIds/>
<FileSize>8358576</FileSize>
<Category id="2192" xlink:href="http://developer.api.cnet.com/rest/v1.0/category
?categoryId=2192&amp;siteId=4&amp;partKey=572kjgq8h2mqbsup36cubkys&amp;partTag=5
72kjgq8h2mqbsup36cubkys"/>
<UserRatingSummary>
<Rating outOf="5">2.0</Rating>
<TotalVotes>7</TotalVotes>
</UserRatingSummary>
<UserRatingProduct>
<Rating outOf="5"/>
<TotalVotes>0</TotalVotes>
</UserRatingProduct>
<EditorsPick/>
<ListingType>STANDARD</ListingType>
</SoftwareProduct>
<SoftwareProduct id="10296367" setId="10065486" xlink:href="http://developer.api
.cnet.com/rest/v1.0/softwareProduct?productSetId=10065486&amp;iod=userRatings&am
p;partKey=572kjgq8h2mqbsup36cubkys&amp;partTag=572kjgq8h2mqbsup36cubkys">
<Name>MP3 CD Maker</Name>
<Version>2.0</Version>
<LinkURL>http://www.download.com/mp3-cd-maker/3000-2140_4-10065486.html?tag=api<
/LinkURL>
<Publisher id="83016">
<Name>ZY Computing</Name>
<LinkURL>http://www.dvdsanta.com</LinkURL>
<UrsRegId/>
</Publisher>
<License>Free to try</License>
<BetaRelease>false</BetaRelease>
<Price currency="USD">$24.95</Price>
<Summary>Create audio CDs from your MP3 collection.</Summary>
<Description>&lt;p&gt;MP3 CD Maker works with a CD recorder to create audio CDs
from collections of MP3 audio files. It directly converts MP3 files into the CD
audio format and can decode any MP3 file into WAV or raw audio. A normalization
feature lets you ensure that all MP3s in a set have the same volume level. &lt;/
p&gt;&lt;p&gt;Version 2.0 adds support for 200 more CD-R/RW drives.&lt;/p&gt;</D
escription>
<WhatsNew/>
<Requirements>Windows 95/98/Me/NT/2000/XP</Requirements>
<Platform>Windows</Platform>
<OperatingSystems>
<OperatingSystem id="3">Windows</OperatingSystem>
<OperatingSystem id="5">Windows 95</OperatingSystem>
<OperatingSystem id="8">Windows NT</OperatingSystem>
<OperatingSystem id="6">Windows 98</OperatingSystem>
<OperatingSystem id="7">Windows Me</OperatingSystem>
<OperatingSystem id="17">Windows 2000</OperatingSystem>
<OperatingSystem id="25">Windows XP</OperatingSystem>
</OperatingSystems>
<EditorsRating outOf="5">4.0</EditorsRating>
<EditorsNote/>
<PreferredNode id="2140"/>
<WeeklyDownloads>103</WeeklyDownloads>
<TotalDownloads>1653394</TotalDownloads>
<CreatedDate>2004-06-16 19:07:46.0</CreatedDate>
<ReleaseDate>2004-06-16 00:00:00.0</ReleaseDate>
<ReviewDate>2009-02-27 00:00:00.0</ReviewDate>
<Limitations>limited to 4 songs on a CD</Limitations>
<BuyNowUrl type="dl_buy_ond">http://send.onenetworkdirect.net/z/126524/CD103284/
</BuyNowUrl>
<TrialPayUrl/>
<CleverBridgeUrl/>
<UpsellUnit/>
<ButtonPartner/>
<CNETContentIds/>
<FileSize>1283187</FileSize>
<Category id="2140" xlink:href="http://developer.api.cnet.com/rest/v1.0/category
?categoryId=2140&amp;siteId=4&amp;partKey=572kjgq8h2mqbsup36cubkys&amp;partTag=5
72kjgq8h2mqbsup36cubkys"/>
<UserRatingSummary>
<Rating outOf="5">2.0</Rating>
<TotalVotes>3</TotalVotes>
</UserRatingSummary>
<UserRatingProduct>
<Rating outOf="5">2.0</Rating>
<TotalVotes>3</TotalVotes>
</UserRatingProduct>
<EditorsPick/>
<ListingType>STANDARD</ListingType>
</SoftwareProduct>
</SoftwareProducts>
</CNETResponse>

如何修复我的循环,以便返回的第一个产品的数据将进入 8 列,然后每个后续产品将从新行开始,并且每个产品的数据都将进入?

谢谢!


在 Birei 的帮助下,我能够获取数据,并且我想出了如何在使用此代码返回的每 8 个项目后开始新行:

strip1 = []
for y in data:
    strip1.extend(y.contents)
    print strip1
for x in xrange(0,len(strip1),8):
    f.writerow(strip1[x:x+8])

我剩下的唯一问题是,有时“评级”的 find_all 会得到 2 个评级,有时只有 1 个评级。这搞乱了我每 8 次开始一个新行,因为有时只返回 7 个项目。如果仅返回 1 个评级,如何在第二个“评级”中打印“无”?

4

1 回答 1

1

用于writerow()数据,就像您已经拥有的标题一样。您无需转换任何内容,因为contents属性会返回一个列表:

for x in data:
    strip1 = x.contents
    f.writerow(strip1)

编辑:如果上述解决方案不起作用,因为contents每次返回一个元素,请尝试将它们保存到数组并在最后打印:

strip1 = []
for x in data:
    strip1.extend(x.contents)
f.writerow(strip1)

新编辑:查看您的xml文件后,我的方法是遍历每个<SoftwareProduct>元素并从那里提取您想要的字段,例如:

for product in soup.find_all('SoftwareProduct'):
    strip1 = []
    strip1.extend(product.Name.contents)
    strip1.extend(product.Price.contents)
    ...
    f.writerow(strip1)
于 2013-11-01T20:58:40.280 回答