python - 美丽汤中的网址错误

Question

我正在尝试使用 beautifulsoup 从 Craigslist 获取数据 PID 和价格。我编写了一个单独的代码，它给了我文件 CLallsites.txt。在这段代码中，我试图从 txt 文件中获取每个站点，并获取前 10 页中所有条目的 PID。我的代码是：

  from bs4 import BeautifulSoup       
  from urllib2 import urlopen 
  readfile = open("CLallsites.txt")
  product = "mcy"
  while 1:
    u = ""
    count = 0
    line = readfile.readline()
    commaposition = line.find(',')
    site = line[0:commaposition]
    location = line[commaposition+1:]
    site_filename = location + '.txt'
    f = open(site_filename, "a")
    while (count < 10):
       sitenow = site + "\\" + product + "\\" + str(u)
       html = urlopen(str(sitenow))                      
       soup = BeautifulSoup(html)                
       postings = soup('p',{"class":"row"})
       for post in postings:
            y = post['data-pid']
            print y
       count = count +1
       index = count*100
       u = "index" + str(index) + ".html"
    if not line:
        break
    pass

我的 CLallsites.txt 看起来像这样：

craiglist 站点，位置（Stackoverflow 不允许使用 cragslist 链接发布，因此我无法显示文本，如果有帮助，我可以尝试附加文本文件。）

当我运行代码时，出现以下错误：

回溯（最近一次通话最后）：

文件“reading.py”，第 16 行，在 html = urlopen(str(sitenow))

文件“/usr/lib/python2.7/urllib2.py”，第 126 行，在 urlopen 返回 _opener.open(url, data, timeout)

文件“/usr/lib/python2.7/urllib2.py”，第 400 行，打开响应 = self._open(req, data)

文件“/usr/lib/python2.7/urllib2.py”，第 418 行，在 _open '_open'，req)

_call_chain 结果 = func(*args) 中的文件“/usr/lib/python2.7/urllib2.py”，第 378 行

文件“/usr/lib/python2.7/urllib2.py”，第 1207 行，在 http_open 返回 self.do_open(httplib.HTTPConnection, req)

文件“/usr/lib/python2.7/urllib2.py”，第 1177 行，在 do_open 中引发 URLError(err)

urllib2.URL错误：

关于我做错了什么的任何想法？

score 0 · Accepted Answer

我不知道的内容是什么sitenow，但看起来它是一个无效的 URL。请注意，URL 使用斜杠而不是反斜杠（因此该语句应该类似于sitenow = site + "/" + product + "/" + str(u)）

python - 美丽汤中的网址错误

1 回答 1

Related

Reference