我也不知道有任何现有的软件包可以满足您的所有需求。但是,Python 可以连接到您的数据库、轻松地发出 Web 请求并处理脏 html。假设您已经安装了 Python,您将需要三个包:
您可以使用 pip 命令或 Windows 安装程序安装这些软件包。每个站点上都有相应的说明。整个过程不会超过10分钟。
import MySQLdb as db
import os.path
import requests
from bs4 import BeautifulSoup
# Connect to the database. Fill in these fields as necessary.
con = db.connect(host='hostname', user='username', passwd='password',
db='dbname')
# Create and execute our SELECT sql statement.
select = con.cursor()
select.execute('SELECT filename FROM table_name \
WHERE format = ? AND description = NULL',
('Still Image (JPEG)',))
while True:
# Fetch a row from the result of the SELECT statement.
row = select.fetchone()
if row is None: break
# Use Python's built-in os.path.splitext to split the extension
# and get the url_name.
filename = row[0]
url_name = os.path.splitext(filename)[0]
url = 'http://www.website.com/content/' + url_name
# Make the web request. You may want to rate-limit your requests
# so that the website doesn't get angry. You can slow down the
# rate by inserting a pause with:
#
# import time # You can put this at the top with other imports
# time.sleep(1) # This will wait 1 second.
response = requests.get(url)
if response.status_code != 200:
# Don't worry about skipped urls. Just re-run this script
# on spurious or network-related errors.
print 'Error accessing:', url, 'SKIPPING'
continue
# Parse the result. BeautifulSoup does a great job handling
# mal-formed input.
soup = BeautifulSoup(response.content)
description = soup.find('div', {'id': 'description'}).contents
# And finally, update the database with another query.
update = db.cursor()
update.execute('UPDATE table_name SET description = ? \
WHERE filename = ?',
(description, filename))
我会警告说,我已经努力使该代码“看起来正确”,但我还没有实际测试过它。您需要填写私人详细信息。