mysql - 将外部网站上的内容与 mySQL 数据库中的条目配对

Question

tl; dr：我正在寻找一种方法来查找我们数据库中缺少信息的条目，从网站获取该信息并将其添加到数据库条目中。

我们有一个媒体管理程序，它使用一个 mySQL 表来存储信息。当员工下载媒体（视频文件、图像、音频文件）并将其导入媒体管理器时，他们假设还复制媒体描述（来自源网站）并将其添加到媒体管理器中的描述中。然而，这还没有为数千个文件完成。

文件名（例如file123 .mov）是唯一的，可以通过转到源网站上的 URL 访问该文件的详细信息页面：

website.com/content/file123 _

我们要从该页面中抓取的信息具有始终相同的元素 ID。

在我看来，这个过程是：

连接数据库并加载表

过滤器："format"是"Still Image (JPEG)"

过滤器："description"是"NULL"

获得第一个结果

"FILENAME"无需扩展即可获取）

加载网址：website.com/content/FILENAME

复制元素的内容"description"（在网站上）

将内容粘贴到"description"（SQL 条目）

获得第二个结果

冲洗并重复直到达到最后一个结果

我的问题是：

是否有软件可以执行这样的任务，还是需要编写脚本？
如果编写脚本，最好的脚本类型是什么（例如，我可以使用 AppleScript 实现这一点，还是需要用 java 或 php 等编写）

score 2 · Accepted Answer

是否有软件可以执行这样的任务，还是需要编写脚本？

我不知道有什么东西可以开箱即用（即使有，所需的配置也不会比滚动您自己的解决方案所涉及的脚本少得多）。

如果编写脚本，最好的脚本类型是什么（例如，我可以使用 AppleScript 实现这一点，还是需要用 java 或 php 等编写）

AppleScript 无法连接到数据库，因此您肯定需要添加其他内容。如果要在 Java 和 PHP 之间进行选择（并且您对两者都同样熟悉），我肯定会为此目的推荐 PHP，因为涉及的代码会少得多。

你的 PHP 脚本看起来像这样：

$BASEURL  = 'http://website.com/content/';

// connect to the database
$dbh = new PDO($DSN, $USERNAME, $PASSWORD);

// query for files without descriptions
$qry = $dbh->query("
  SELECT FILENAME FROM mytable
  WHERE  format = 'Still Image (JPEG)' AND description IS NULL
");

// prepare an update statement
$update = $dbh->prepare('
  UPDATE mytable SET description = :d WHERE FILENAME = :f
');

$update->bindParam(':d', $DESCRIPTION);
$update->bindParam(':f', $FILENAME);

// loop over the files
while ($FILENAME = $qry->fetchColumn()) {
  // construct URL
  $i = strrpos($FILENAME, '.');
  $url = $BASEURL . (($i === false) ? $FILENAME : substr($FILENAME, 0, $i));

  // fetch the document
  $doc = new DOMDocument();
  $doc->loadHTMLFile($url);

  // get the description
  $DESCRIPTION = $doc->getElementsById('description')->nodeValue;

  // update the database
  $update->execute();
}

score 1 · Accepted Answer

PHP 是一个很好的爬虫。我在这里创建了一个包含 PHP 的 cURL 端口的类：

http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

您可能需要使用一些选项：

http://www.php.net/manual/en/function.curl-setopt.php

为了抓取 HTML，我通常使用正则表达式，但这是我创建的一个类，它应该能够毫无问题地查询 HTML：

http://pastebin.com/Jm9jKjAU

用途是：

$h = new HTMLQuery();
$h->load( $string_containing_html );
$h->getElements( 'p', 'id' ); // Returns all p tags with an id attribute

抓取的最佳选择是 XPath，但它不能处理脏 HTML。您可以使用它来执行以下操作：

//div[@class = 'itm']/p[last() and text() = 'Hello World'] <- 选择具有 innerHTML 'Hello World' 的 div 元素中的最后一个 p

您可以在 PHP 中将其与 DOM 类（内置）一起使用。

score 1 · Accepted Answer

我也不知道有任何现有的软件包可以满足您的所有需求。但是，Python 可以连接到您的数据库、轻松地发出 Web 请求并处理脏 html。假设您已经安装了 Python，您将需要三个包：

MySQLdb用于连接数据库。
轻松发出 http web请求的请求。
BeautifulSoup用于强大的 html 解析。

您可以使用 pip 命令或 Windows 安装程序安装这些软件包。每个站点上都有相应的说明。整个过程不会超过10分钟。

import MySQLdb as db
import os.path
import requests
from bs4 import BeautifulSoup

# Connect to the database. Fill in these fields as necessary.

con = db.connect(host='hostname', user='username', passwd='password',
                 db='dbname')

# Create and execute our SELECT sql statement.

select = con.cursor()
select.execute('SELECT filename FROM table_name \
                WHERE format = ? AND description = NULL',
               ('Still Image (JPEG)',))

while True:
    # Fetch a row from the result of the SELECT statement.

    row = select.fetchone()
    if row is None: break

    # Use Python's built-in os.path.splitext to split the extension
    # and get the url_name.

    filename = row[0]
    url_name = os.path.splitext(filename)[0]
    url = 'http://www.website.com/content/' + url_name

    # Make the web request. You may want to rate-limit your requests
    # so that the website doesn't get angry. You can slow down the
    # rate by inserting a pause with:
    #               
    # import time   # You can put this at the top with other imports
    # time.sleep(1) # This will wait 1 second.

    response = requests.get(url)
    if response.status_code != 200:

        # Don't worry about skipped urls. Just re-run this script
        # on spurious or network-related errors.

        print 'Error accessing:', url, 'SKIPPING'
        continue

    # Parse the result. BeautifulSoup does a great job handling
    # mal-formed input.

    soup = BeautifulSoup(response.content)
    description = soup.find('div', {'id': 'description'}).contents

    # And finally, update the database with another query.

    update = db.cursor()
    update.execute('UPDATE table_name SET description = ? \
                    WHERE filename = ?',
                   (description, filename))

我会警告说，我已经努力使该代码“看起来正确”，但我还没有实际测试过它。您需要填写私人详细信息。

mysql - 将外部网站上的内容与 mySQL 数据库中的条目配对

3 回答 3

Related

Reference