python - 使用 Python/BeautifulSoup 遍历 .txt 文件中的多个 URL

Question

我正在尝试创建一个脚本，该脚本采用包含多行 YouTube 用户名的 .txt 文件，将其附加到 YouTube 用户主页 URL，然后通过爬网获取配置文件数据。

下面的代码为我提供了一个用户想要的信息，但我不知道从哪里开始导入和迭代多个 URL。

#!/usr/bin/env python
# -- coding: utf-8 --
from bs4 import BeautifulSoup
import re
import urllib2

# download the page
response = urllib2.urlopen("http://youtube.com/user/alxlvt")
html = response.read()

# create a beautiful soup object
soup = BeautifulSoup(html)

# find the profile info & display it
profileinfo = soup.findAll("div", { "class" : "user-profile-item" })
for info in profileinfo:
    print info.get_text()

有人有什么建议吗？

例如，如果我有一个 .txt 文件，内容如下：

username1
username2
username3
etc.

我怎样才能遍历这些，将它们附加到http://youtube.com/user/%s并创建一个循环来提取所有信息？

score 2 · Accepted Answer

如果您不想使用实际的抓取模块（如 scrapy、mechanize、selenium 等），您可以继续迭代您编写的内容。

使用文件对象的迭代逐行读取关于文件对象的一个简洁事实是，如果它们是用“rb”打开的，它们实际上调用 readline() 作为它们的迭代器，所以你可以这样for line in file_obj做在文档中逐行进行。
我在下面使用的连接网址+，但您也可以使用连接功能。

制作一个 url 列表- 让你错开你的请求，这样你就可以进行富有同情心的屏幕抓取。

# Goal: make a list of urls
url_list = []

# use a try-finally to make sure you close your file.
try:
    f = open('pathtofile.txt','rb')
    for line in f:
        url_list.append('http://youtube.com/user/%s' % line)
    # do something with url list (like call a scraper, or use urllib2
finally:
    f.close()

编辑：Andrew G 的字符串格式更清晰。:)

score 0 · Accepted Answer

您需要打开文件（最好使用with open('/path/to/file', 'r') as f:语法），然后f.readline()循环执行。将 readline() 的结果分配给“用户名”之类的字符串，然后在循环中运行当前代码，以response = urllib2.urlopen("http://youtube.com/user/%s" % username).

python - 使用 Python/BeautifulSoup 遍历 .txt 文件中的多个 URL

2 回答 2

Related

Reference