python - 使用 beatifulsoup 从多个 url 中抓取

Question

我有我的代码工作。现在我想做一些修改以从多个 URL 中获取日期，但是 URL 只有一个单词的区别。

这是我的代码，我只从一个 URL 获取。

from string import punctuation, whitespace
import urllib2
import datetime
import re
from bs4 import BeautifulSoup as Soup
import csv
today = datetime.date.today()
html = urllib2.urlopen("http://www.99acres.com/property-in-velachery-chennai-south-ffid").read()

soup = Soup(html)
print "INSERT INTO `property` (`date`,`Url`,`Rooms`,`place`,`PId`,`Phonenumber1`,`Phonenumber2`,`Phonenumber3`,`Typeofperson`,` Nameofperson`,`typeofproperty`,`Sq.Ft`,`PerSq.Ft`,`AdDate`,`AdYear`)"
print 'VALUES'
re_digit = re.compile('(\d+)')
properties = soup.findAll('a', title=re.compile('Bedroom'))

for eachproperty in soup.findAll('div', {'class':'sT'}):
  a      = eachproperty.find('a', title=re.compile('Bedroom'))
  pdate  = eachproperty.find('i', {'class':'pdate'})
  pdates = re.sub('(\s{2,})', ' ', pdate.text)
  div    = eachproperty.find('div', {'class': 'sT_disc grey'})
  try:
    project = div.find('span').find('b').text.strip()
  except:
    project = 'NULL'        
  area = re.findall(re_digit, div.find('i', {'class': 'blk'}).text.strip())
  print ' ('
  print today,","+ (a['href'] if a else '`NULL`')+",", (a.string if a else 'NULL, NULL')+ "," +",".join(re.findall("'([a-zA-Z0-9,\s]*)'", (a['onclick'] if a else 'NULL, NULL, NULL, NULL, NULL, NULL')))+","+ ", ".join([project] + area),","+pdates+""
  print ' ), '

这是我想同时获取的 URL

http://www.99acres.com/property-in-velachery-chennai-south-ffid
http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid
http://www.99acres.com/property-in-madipakkam-chennai-south-ffid

所以你可以看到每个 URL 中只有一个词不同。

我正在尝试创建一个如下所示的数组

for locality in areas (http://www.99acres.com/property-in-velachery-chennai-south-ffid
, http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid,    http://www.99acres.com/property-in-madipakkam-chennai-south-ffid):
link = "str(locality)"
html = urllib2.urlopen(link)
soup = Soup(html)

这似乎不起作用，我实际上只想像这样将一个单词传递给 URL

for locality in areas(madipakkam, thoraipakkam, velachery):
    link = “http://www.99acres.com/property-in-+ str(locality)+-chennai-south-ffid"
    html= urllib2.urlopen(link)
    soup = BeautifulSoup(html)

希望我说清楚了

score 2 · Accepted Answer

这个：

for locality in areas (http://www.99acres.com/property-in-velachery-chennai-south-ffid, http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid,    http://www.99acres.com/property-in-madipakkam-chennai-south-ffid):
link = "str(locality)"

......由于多种原因，它不会起作用。

首先，您正在调用一个areas从未在任何地方定义过的函数。而且我不确定您希望该功能做什么。

其次，当它甚至不可解析时，你试图传递http://www.99acres.com/property-in-velachery-chennai-south-ffid它，就好像它是一个有意义的 Python 表达式一样。如果你想传递一个字符串，你必须把它放在引号中。

第三，"str(locality)"是文字串str(locality)。如果要str在变量上调用函数locality，请不要在其周围加上引号。但实际上，根本没有理由打电话str。已经locality是一个字符串。

最后，您没有缩进for循环体。您必须缩进该link =行，以及您之前在顶层所做的所有事情，以便它属于for. 这样，循环中的每个值都会发生一次，而不是在所有循环完成后总共发生一次。

尝试这个：

for link in ("http://www.99acres.com/property-in-velachery-chennai-south-ffid",
             "http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid",
             "http://www.99acres.com/property-in-madipakkam-chennai-south-ffid"):
    # all the stuff you do for each URL

你在正确的轨道上：

for locality in areas(madipakkam, thoraipakkam, velachery):
link = “http://www.99acres.com/property-in-+ str(locality)+-chennai-south-ffid"

使用“模板字符串”来避免重复自己几乎总是一个好主意。

但同样，也存在许多问题。

首先，您再次调用了一个areas不存在的函数，并尝试使用不带引号的裸字符串。

其次，您遇到了与上一个问题相反的问题：您试图将要计算的表达式+和str(locality)放入字符串的中间。您需要将其分解为两个单独的字符串，它们可以是+表达式的一部分。

再一次，你没有缩进循环体，你在str不必要地调用。

所以：

for locality in "velachery", "thoraipakkam", "madipakkam":
    link = “http://www.99acres.com/property-in-" + locality + "-chennai-south-ffid"
    # all the stuff you do for each URL

当我们这样做时，当您使用格式化函数而不是尝试将字符串连接在一起时，通常更容易阅读您的代码，并且更容易确保您没有出错。例如：

for locality in "velachery", "thoraipakkam", "madipakkam":
    link = "http://www.99acres.com/property-in-{}-chennai-south-ffid".format(locality)
    # all the stuff you do for each URL

在这里，每个位置适合字符串的位置、字符串的外观、连字符的位置等都非常明显。

python - 使用 beatifulsoup 从多个 url 中抓取

1 回答 1

Related

Reference