php - 为什么 array_unique 不返回唯一项目列表？

Question

我正在尝试抓取客户网站主页上的所有网址，以便将其迁移到 wordpress。问题是我似乎无法获得去重的 url 列表。

这是代码：

$html = file_get_contents('http://www.catwalkyourself.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
   $href = $hrefs->item($i);
   $url = $href->getAttribute('href');

   if($url = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])){
    $urls = $matches[0][0][0];
    $list = implode( ', ', array_unique( explode(", ", $urls) ) );
    echo $list . '<br/>';
    //print_r($list);
   }
}

（也张贴在这里。）

相反，我得到这样的重复：

http://www.catwalkyourself.com/rss.php
http://www.catwalkyourself.com/rss.php

我该如何解决？

score 3 · Accepted Answer

现在使用循环构造代码的方式，您总是array_unique使用大小为 1 的数组进行调用。

您需要构建一个 URL 列表，然后调用 array_unique。试试这个：

<?php

$html = file_get_contents('http://www.catwalkyourself.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$urls  = array();

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url  = $href->getAttribute('href');

    if( ($count = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])) > 0) {
        $urls[] = $matches[0][0][0]; // build list of URLs in the loop
    }
}

$list = implode( ', ', array_unique( $urls ) );
echo $list . '<br/>';

score 1 · Accepted Answer

代码的最后一部分不应该在循环中。您正在遍历包含页面上每个链接的数组。由于此数组的每个元素仅包含一个链接，因此您正在应用array_unique一个不能包含多个元素的数组。

尝试这样的事情：

$html = file_get_contents('http://www.catwalkyourself.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$urls = array();

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');

    if($url = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])){
        $urls[] = $matches[0][0][0];
    }
}
$list = implode(', ', array_unique($urls));
echo $list . '<br/>';

php - 为什么 array_unique 不返回唯一项目列表？

2 回答 2

Related

Reference