0

I've been spending a lot of time trying to figure out the issue here but unfortunately had no luck yet. I hope someone can point me in the right direction.

I'm trying to extract the address elements for all properties listed in the link provided in the script below. However, the output is always an empty list. I've tried different variations but none worked.

lapply(c('XML','httr'),require,character.only=TRUE)
link <- "http://www.realtor.ca/Map.aspx?CultureId=1&ApplicationId=1&RecordsPerPage=9&MaximumResults=9&PropertyTypeId=300&TransactionTypeId=2&SortOrder=A&SortBy=1&LongitudeMin=-114.52066040039104&LongitudeMax=-113.60536193847697&LatitudeMin=50.94776904194829&LatitudeMax=51.14246522072541&PriceMin=0&PriceMax=0&BedRange=0-0&BathRange=0-0&ParkingSpaceRange=0-0&viewState=m&Longitude=-114.063011169434&Latitude=51.0452194213867&ZoomLevel=11&CurrentPage=1#CultureId=1&ApplicationId=1&RecordsPerPage=9&MaximumResults=9&PropertyTypeId=300&TransactionTypeId=2&SortOrder=A&SortBy=1&LongitudeMin=-114.9913558959965&LongitudeMax=-113.1346664428715&LatitudeMin=50.91552869934793&LatitudeMax=51.1745480567661&PriceMin=0&PriceMax=0&BedRange=0-0&BathRange=0-0&ParkingSpaceRange=0-0&viewState=l&Longitude=-114.063011169434&Latitude=51.0452194213867&ZoomLevel=11&CurrentPage=1"
doc <- htmlTreeParse(link,useInternalNodes = T)
addresses <- xpathSApply(doc,"//div[@id='listView']//span",xmlValue)

the output of addresses is as follows:

> addresses
list()

In fact, I was not able to fetch any of the other html elements in the link above. I wonder if it's because the page takes a while to load whereas GET{httr} or htmlTreeParse{XML} instantaneously scrape the webpage without giving it a chance to fully load first. Not sure if my reasoning makes sense. I would appreciate community's assistance with this issue.

4

2 回答 2

3

The site is using an AJAX POST call to http://www.realtor.ca/api/Listing.svc/PropertySearch_Post to dynamically retrieve the data for the resultant list. You'll need to do the same thing get the raw data, then extract the address from the resultant R list structure:

library(httr)

params <- list(CultureId=1,
               ApplicationId=1,
               RecordsPerPage=9,
               MaximumResults=9,
               PropertyTypeId=300,
               TransactionTypeId=2,
               SortOrder="A",
               SortBy=1,
               LongitudeMin="-114.9913558959965",
               LongitudeMax="-113.1346664428715",
               LatitudeMin="50.91552869934793",
               LatitudeMax="51.1745480567661",
               PriceMin=0,
               PriceMax=0,
               BedRange="0-0",
               BathRange="0-0",
               ParkingSpaceRange="0-0",
               viewState="l",
               Longitude="-114.063011169434",
               Latitude="51.0452194213867",
               ZoomLevel=11,
               CurrentPage=1)


pg <- POST("http://www.realtor.ca/api/Listing.svc/PropertySearch_Post",
           body=params, encode="form")

data <- content(pg)

sapply(data$Results, function(x) { x$Property$Address$AddressText })

## [1] "# 297 6220 17 AV SE|Penbrooke, Calgary, Alberta T2A0W6"         
## [2] "# 298 6220 17 AV|Redcarpet Mountview, Calgary, Alberta T2A0W6"  
## [3] "10 VILLAGE WY|Westpark Village, Strathmore, Alberta T1P1A2"     
## [4] "51 Village WY|Downtown Strathmore, Strathmore, Alberta T1P1A2"  
## [5] "# 324 6220 17 AV SE|Penbrooke, Calgary, Alberta T2A7H4"         
## [6] "# 345 6220 17 AV SE|Penbrooke, Calgary, Alberta T2A7H4"         
## [7] "# 28 6724 17 AV SE|Redcarpet Mountview, Calgary, Alberta T2A0W5"
## [8] "# 328 6220 17 AV SE|Penbrooke, Calgary, Alberta T2A7H4"         
## [9] "# 253 99 Arbour Lake RD NW|Arbour Lake, Calgary, Alberta T3G4E4"

Caveat Scraper

I feel compelled to point out that using this code violates the copyright notice of the site:

This database and all materials on this website are protected by copyright laws and are owned by CREA, or by the member who has supplied the data, and/or by other third parties. Property listings and other data available on this website are intended for the private, non-commercial use by individuals. Any commercial use of the listings or data in whole or in part, directly or indirectly, is specifically forbidden except with the prior written authority of the owner of the copyright.

Users may, subject to these Terms of Use, print or otherwise save individual pages for private use. However, property listings and/or data may not be modified or altered in any respect, merged with other data or published in any form, in whole or in part. The prohibited uses include "screen scraping", "database scraping" and any other activity intended to collect, store, reorganize or manipulate data on the pages produced by, or displayed on the CREA websites..

and that using this code would definitely be a violation of said TOS.

I'm only mentioning it since orgs like Realtor.[com|ca] do look for said scraping activities and trace them back to IP addresses.

于 2014-12-03T09:25:10.443 回答
0

I checked your code: it is correct. The problem is that the HTML page you have to deal with is not well formed. For this reason, xpath is not able to obtain the addresses.

Either you have control over this page and/or the way it is generated. In this case, you have to make it well formed.

Or you can't. In this case, you have to follow another approach: load the HTML code page as a string, and extract the addresses using either substrings or better, regular expressions.

于 2014-12-03T09:22:36.170 回答