php - 从网站上多个页面上出现的 DIV 中提取文本，然后输出到 .txt？

Question

只是从一开始就注意，内容是无版权的，我想自动化获取文本以用于项目目的的过程。

我想从一个特定且重复出现的 DIV 中提取文本（这归因于它自己的“类”，以防万一它更容易）坐在一个简单设计的网站的每个页面中。

网站上有一个存档页面，其中列出了包含我想要的内容的所有页面。

该网站是 www.zenhabits.net

我想这可以通过某种脚本来实现，但不知道从哪里开始。

我很感激任何帮助。

-内森。

score 0 · Accepted Answer

这很简单。

首先，从该站点获取所有链接，并将它们全部放入一个数组中：

set_time_limit(0);//this could take a while...

ignore_user_abort(true);//in case browser times out


$html_output=file_get_contents("http://zenhabits.net/archives/");

# -- Do a preg_match on the html, and grab all links:
if(preg_match_all('/<a href=\"http:\/\/zenhabits.net\/(.*)\">/',$html_output,$matches)) {
# -- Append Data To Array
foreach($matches[1] as $secLink) {  
    $links[] = "http://zenhabits.net/".$secLink;
}
    }

我为你测试了这个，并且：

//first 3 are returning something weird, but you don't need them - so I shall remove them xD
unset($links[0]);
unset($links[1]);
unset($links[2]);

不，这一切都完成了，是时候浏览所有这些链接（在数组 $links 中），并获取其内容：

foreach($links as $contLink){

$html_output_c=file_get_contents("$contLink");


    if(preg_match('|<div class=\"post\">(.*)</div>|s',$html_output_c,$c_matches)) {
    # -- Append Data To Array   
echo"data found <br>";
    $contentFromPage[] = $c_matches[1];
    }
else{echo "no content found in: $contLink -- <br><br><br>";}
}//end of foreach

我基本上只是为你写了一个完整的爬虫脚本..

现在，循环内容数组，并用它做任何你想做的事情（这里我们将把它放到一个文本文件中）：

//$contentFromPage now contains all of div class="post" content (in an array) - so do what you want with it

    foreach($contentFromPage as $content){

    # -- We need a name for each text file --
$textName=rand()."_content_".rand().".txt";//we'll just use some numbers and text

//define file path (where you want the txt file to be saved)
$path="../";//we'll just put it in a folder above the script
$full_path=$path.$textName; 

// now save the file..

file_put_contents($full_path,$content);

//and that's it

    }//end of foreach

score 0 · Accepted Answer

您也可以使用 SimpleHTML DOM Parser 脚本来提取内容。这是一个非常有用的脚本，我已经使用了 1.6 年。您可以从http://simplehtmldom.sourceforge.net/下载脚本。它有很好的例子记录。希望这将帮助您解决您的问题。

php - 从网站上多个页面上出现的 DIV 中提取文本，然后输出到 .txt？

2 回答 2

Related

Reference