content-management - 获取网站数据（内容）的最佳方式？

Question

我需要抓取一些网站数据（内容）那些网站提供列表我需要抓取那些并根据内容过滤它们

有什么软件可以做到吗？php脚本？如果没有，我可以从哪里开始编程这个功能？

score 1 · Accepted Answer

使用 file_get_contents() 将整个文件返回一个字符串，然后解析字符串以提取内容。

其他选项是 cURL 或 wget，它们将获取整个文件，然后使用 AWK 和 SED 或 PERL 处理它们

取决于您需要多久抓取一次目标页面。如果偶尔使用 PHP，但您需要从浏览器触发它，并且记住 PHP 中的正则表达式可能很耗时。

如果您想定期抓取文件，则可以在 cron 中运行带有 cURL/wget + sed 和 awk 的 BASH 脚本，而无需干预并在后台运行。

score 1 · Accepted Answer

如果它的 php .. 可能对你有帮助 .. http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial

// get the HTML
$html = file_get_contents("http://www.thefutureoftheweb.com/blog/");


preg_match_all(
    '/<li>.*?<h1><a href="(.*?)">(.*?)<\/a><\/h1>.*?<span class="date">(.*?)<\/span>.*?<div class="section">(.*?)<\/div>.*?<\/li>/s',
    $html,
    $posts, // will contain the blog posts
    PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
    $link = $post[1];
    $title = $post[2];
    $date = $post[3];
    $content = $post[4];

    // do something with data
}

当然，您需要根据您的要求自定义正则表达式。

还有很多其他的例子你可以找到.. http://www.google.com/search?source=ig&hl=en&rlz=&=&q=php+web+scraper&aq=f&oq=&aqi=

score 0 · Accepted Answer

没有什么神奇的东西。因为每一页内容都不一样。
当您谈论 PHP 时，我将为您提供一些有关这种语言的线索。

您可以使用curl获取网页。
获取内容后，可以使用正则表达式进行解析。

根据您想要做什么，您必须自己开发应用程序。

content-management - 获取网站数据（内容）的最佳方式？

3 回答 3

Related

Reference