0

我正在尝试使用 zillow 的 API 收集有关房屋的所有数据。我得到了一些字段,但其他字段返回为空。

这是我的 Python 代码:

from bs4 import BeautifulSoup
import requests
import urllib, urllib2
import csv


url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
pageText = url.text
soup = BeautifulSoup(pageText)

useCode = soup.find('useCode')
taxAssessmentYear = soup.find('taxAssessmentYear')
taxAssessment = soup.find('taxAssessment')
yearBuilt = soup.find('yearBuilt')
lotSizeSqFt = soup.find('lotSizeSqFt')
finishedSqFt = soup.find('finishedSqFt')
bathrooms = soup.find('bathrooms')
lastSoldDate = soup.find('lastSoldDate')
lastSoldPrice = soup.find('lastSoldPrice')
zestimate = soup.find('zestimate')
amount = soup.find('amount')
lastupdated = soup.find('last-updated')
valueChangeduration = soup.find('valueChange')
valuationRange = soup.find('valuationRange')
lowcurrency = soup.find('low')
highcurrency = soup.find('high')
percentile = soup.find('percentile')
localRealEstate = soup.find('localRealEstate')
region = soup.find('region')
links = soup.find('links')
overview = soup.find('overview')
forSaleByOwner = soup.find('forSaleByOwner')
forSale = soup.find('forSale')




array = [
            ['useCode ' , useCode],
            ['taxAssessmentYear ' , taxAssessmentYear],
            ['taxAssessment ' , taxAssessment],
            ['yearBuilt ' , yearBuilt],
            ['lotSizeSqFt ' , lotSizeSqFt],
            ['finishedSqFt ' , finishedSqFt],
            ['bathrooms ' , bathrooms],
            ['lastSoldDate ' , lastSoldDate],
            ['lastSoldPrice ' , lastSoldPrice],
            ['zestimate ' , zestimate],
            ['amount ' , amount],
            ['lastupdated ' , lastupdated],
            ['valueChangeduration ' , valueChangeduration],
            ['valuationRange ' , valuationRange],
            ['lowcurrency ' , lowcurrency],
            ['highcurrency ' , highcurrency],
            ['percentile ' , percentile],
            ['localRealEstate ' , localRealEstate],
            ['region ' , region],
            ['links ' , links],
            ['overview ' , overview],
            ['forSaleByOwner ' , forSaleByOwner],
            ['forSale ' , forSale]]


for x in array:
    print x

我得到的结果有很多缺失值,如下所示:

['useCode ', None]
['taxAssessmentYear ', None]
['taxAssessment ', None]
['yearBuilt ', None]
['lotSizeSqFt ', None]
['finishedSqFt ', None]
['bathrooms ', <bathrooms>2.0</bathrooms>]
['lastSoldDate ', None]
['lastSoldPrice ', None]
['zestimate ', <zestimate>
<amount currency="USD">977262</amount>
<last-updated>01/23/2014</last-updated>
<oneweekchange deprecated="true">
<valuechange currency="USD" duration="30">-25723</valuechange>
<valuationrange>
<low currency="USD">928399</low>
<high currency="USD">1055443</high>
</valuationrange>
<percentile>0</percentile>
</oneweekchange></zestimate>]
['amount ', <amount currency="USD">977262</amount>]
['lastupdated ', <last-updated>01/23/2014</last-updated>]
['valueChangeduration ', None]
['valuationRange ', None]
['lowcurrency ', <low currency="USD">928399</low>]
['highcurrency ', <high currency="USD">1055443</high>]
['percentile ', <percentile>0</percentile>]
['localRealEstate ', None]
['region ', <region id="46465" name="Mc Lean" type="city">
<links>
<overview>
http://www.zillow.com/local-info/VA-Mc-Lean/r_46465/
</overview>
<forsalebyowner>http://www.zillow.com/mc-lean-va/fsbo/</forsalebyowner>
<forsale>http://www.zillow.com/mc-lean-va/</forsale>
</links>
</region>]
['links ', <links>
<homedetails>
http://www.zillow.com/homedetails/6870-Churchill-Rd-Mc-Lean-VA-22101/51751742_zpid/
</homedetails>
<graphsanddata>
http://www.zillow.com/homedetails/6870-Churchill-Rd-Mc-Lean-VA-22101/51751742_zpid/#charts-and-data
</graphsanddata>
<mapthishome>http://www.zillow.com/homes/51751742_zpid/</mapthishome>
<comparables>http://www.zillow.com/homes/comps/51751742_zpid/</comparables>
</links>]
['overview ', <overview>
http://www.zillow.com/local-info/VA-Mc-Lean/r_46465/
</overview>]
['forSaleByOwner ', None]
['forSale ', None]
[Finished in 0.6s]

关于造成这种情况的任何想法?

4

2 回答 2

1

默认情况下,BeautifulSoup将所有标签强制转换为小写。您可以在上面的结果数据中看到这一点:region标签包含forsalebyowner和作为其内容的一部分,而forsale它们在原始数据中。forSaleByOwnerforSale

值得庆幸的是,您可以通过在创建BeautifulSoup对象时指定使用 XML 来覆盖此行为,但是在这样做之前您需要修剪掉一些非 XML 页面内容:

url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
pageText = url.text.split('\n')
# exclude initial text & end comment
pageXML = ''.join( pageText[1:pageText.index(u'<!--')] )
soup = BeautifulSoup(pageXML, "xml")
于 2014-01-28T02:07:36.207 回答
0

beautifulsoup find 查询是小写的

>>> url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
>>> soup = BeautifulSoup(pageText)
>>> soup.find('usecode')
<usecode>SingleFamily</usecode>
>>> soup.find('usecode').text
u'SingleFamily'

或者:

>>> soup.response.results.result.usecode
<usecode>SingleFamily</usecode>
>>> soup.response.results.result.usecode.text
u'SingleFamily'
于 2014-01-28T02:06:31.757 回答