1

我有我的代码工作。现在我想做一些修改以从多个 URL 中获取日期,但是 URL 只有一个单词的区别。

这是我的代码,我只从一个 URL 获取。

from string import punctuation, whitespace
import urllib2
import datetime
import re
from bs4 import BeautifulSoup as Soup
import csv
today = datetime.date.today()
html = urllib2.urlopen("http://www.99acres.com/property-in-velachery-chennai-south-ffid").read()

soup = Soup(html)
print "INSERT INTO `property` (`date`,`Url`,`Rooms`,`place`,`PId`,`Phonenumber1`,`Phonenumber2`,`Phonenumber3`,`Typeofperson`,` Nameofperson`,`typeofproperty`,`Sq.Ft`,`PerSq.Ft`,`AdDate`,`AdYear`)"
print 'VALUES'
re_digit = re.compile('(\d+)')
properties = soup.findAll('a', title=re.compile('Bedroom'))

for eachproperty in soup.findAll('div', {'class':'sT'}):
  a      = eachproperty.find('a', title=re.compile('Bedroom'))
  pdate  = eachproperty.find('i', {'class':'pdate'})
  pdates = re.sub('(\s{2,})', ' ', pdate.text)
  div    = eachproperty.find('div', {'class': 'sT_disc grey'})
  try:
    project = div.find('span').find('b').text.strip()
  except:
    project = 'NULL'        
  area = re.findall(re_digit, div.find('i', {'class': 'blk'}).text.strip())
  print ' ('
  print today,","+ (a['href'] if a else '`NULL`')+",", (a.string if a else 'NULL, NULL')+ "," +",".join(re.findall("'([a-zA-Z0-9,\s]*)'", (a['onclick'] if a else 'NULL, NULL, NULL, NULL, NULL, NULL')))+","+ ", ".join([project] + area),","+pdates+""
  print ' ), '

这是我想同时获取的 URL

http://www.99acres.com/property-in-velachery-chennai-south-ffid
http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid
http://www.99acres.com/property-in-madipakkam-chennai-south-ffid

所以你可以看到每个 URL 中只有一个词不同。

我正在尝试创建一个如下所示的数组

for locality in areas (http://www.99acres.com/property-in-velachery-chennai-south-ffid
, http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid,    http://www.99acres.com/property-in-madipakkam-chennai-south-ffid):
link = "str(locality)"
html = urllib2.urlopen(link)
soup = Soup(html)

这似乎不起作用,我实际上只想像这样将一个单词传递给 URL

for locality in areas(madipakkam, thoraipakkam, velachery):
    link = “http://www.99acres.com/property-in-+ str(locality)+-chennai-south-ffid"
    html= urllib2.urlopen(link)
    soup = BeautifulSoup(html)

希望我说清楚了

4

1 回答 1

2

这个:

for locality in areas (http://www.99acres.com/property-in-velachery-chennai-south-ffid, http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid,    http://www.99acres.com/property-in-madipakkam-chennai-south-ffid):
link = "str(locality)"

......由于多种原因,它不会起作用。

首先,您正在调用一个areas从未在任何地方定义过的函数。而且我不确定您希望该功能做什么。

其次,当它甚至不可解析时,你试图传递http://www.99acres.com/property-in-velachery-chennai-south-ffid它,就好像它是一个有意义的 Python 表达式一样。如果你想传递一个字符串,你必须把它放在引号中。

第三,"str(locality)"是文字串str(locality)。如果要str在变量上调用函数locality,请不要在其周围加上引号。但实际上,根本没有理由打电话str已经locality是一个字符串。

最后,您没有缩进for循环体。您必须缩进该link =行,以及您之前在顶层所做的所有事情,以便它属于for. 这样,循环中的每个值都会发生一次,而不是在所有循环完成后总共发生一次。

尝试这个:

for link in ("http://www.99acres.com/property-in-velachery-chennai-south-ffid",
             "http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid",
             "http://www.99acres.com/property-in-madipakkam-chennai-south-ffid"):
    # all the stuff you do for each URL

你在正确的轨道上:

for locality in areas(madipakkam, thoraipakkam, velachery):
link = “http://www.99acres.com/property-in-+ str(locality)+-chennai-south-ffid"

使用“模板字符串”来避免重复自己几乎总是一个好主意。

但同样,也存在许多问题。

首先,您再次调用了一个areas不存在的函数,并尝试使用不带引号的裸字符串。

其次,您遇到了与上一个问题相反的问题:您试图将要计算的表达式+str(locality)放入字符串的中间。您需要将其分解为两个单独的字符串,它们可以是+表达式的一部分。

再一次,你没有缩进循环体,你在str不必要地调用。

所以:

for locality in "velachery", "thoraipakkam", "madipakkam":
    link = “http://www.99acres.com/property-in-" + locality + "-chennai-south-ffid"
    # all the stuff you do for each URL

当我们这样做时,当您使用格式化函数而不是尝试将字符串连接在一起时,通常更容易阅读您的代码,并且更容易确保您没有出错。例如:

for locality in "velachery", "thoraipakkam", "madipakkam":
    link = "http://www.99acres.com/property-in-{}-chennai-south-ffid".format(locality)
    # all the stuff you do for each URL

在这里,每个位置适合字符串的位置、字符串的外观、连字符的位置等都非常明显。

于 2013-09-16T06:20:26.350 回答