0

再会,

我正在使用 cURL 和各种解析技术从各种网站检索信息。我编写了代码,因此如果需要,我可以添加其他我从中扫描信息的网站。

检索到的信息如下:(请注意,信息可能不准确,可能不反映真实价格/名称)

Array
(
    [website1.com] => Array
        (
            [0] => Array
                (
                    [0] => 60" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 5299.99
                )
            [1] => Array
                (
                    [0] => 52" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 4499.99
                )
            [2] => Array
                (
                    [0] => 46" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 3699.99
                )
            [3] => Array
                (
                    [0] => 40" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 2999.99
                )
        )
    [website2.com] => Array
        (
            [0] => Array
                (
                    [0] => Sony 3D 60" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 5400.99
                )
            [1] => Array
                (
                    [0] => Sony 3D 52" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 4699.99
                )
            [2] => Array
                (
                    [0] => Sony 3D 46" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 3899.99
                )
        )
)

所需的输出必须是:

Array
(
    [0] => Array
        (
            [Name] => 60" BRAVIA LX900 Series 3D HDTV
            [website1.com] => 5299.99
            [website2.com] => 5400.99
        )
    [1] => Array
        (
            [Name] => 52" BRAVIA LX900 Series 3D HDTV
            [website1.com] => 4499.99
            [website2.com] => 4699.99
        )
    [2] => Array
        (
            [Name] => 46" BRAVIA LX900 Series 3D HDTV
            [website1.com] => 3699.99
            [website2.com] => 3899.99
        )
    [3] => Array
        (
            [Name] => 40" BRAVIA LX900 Series 3D HDTV
            [website1.com] => 2999.99
        )
)

请注意,名称可能会有所不同,因此需要使用similar_text。此外,某些信息可能不会显示在所有网站上。我知道只能选择一个电视名称,然后我将使用最相关来源(website1.com)中的一个

这是我正在尝试使用的代码。

<?php
    $_Retreived = array(
        "website1.com" => array(
            array('60" BRAVIA LX900 Series 3D HDTV', 'website1.com', 5299.99),
            array('52" BRAVIA LX900 Series 3D HDTV', 'website1.com', 4499.99),
            array('46" BRAVIA LX900 Series 3D HDTV', 'website1.com', 3699.99),
            array('40" BRAVIA LX900 Series 3D HDTV', 'website1.com', 2999.99)
        ),
        "website2.com" => array(
            array('Sony 3D 60" LX900 HDTV BRAVIA', 'website2.com', 5400.99),
            array('Sony 3D 52" LX900 HDTV BRAVIA', 'website2.com', 4699.99),
            array('Sony 3D 46" LX900 HDTV BRAVIA', 'website2.com', 3899.99),
        )
    );

    $_Prices = array();
    $_PricesTemp = array();
    $_Sites = array("website1.com", "website2.com");

    for($i = 0; $i < sizeOf($_Sites); $i++)
    {
        $_PricesTemp = array_merge($_PricesTemp, $_Retreived[ $_Sites[$i] ]);
    }

    /*
        print_r($_PricesTemp);

        Array
        (
            [0] => Array
                (
                    [0] => 60" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 5299.99
                )
            [1] => Array
                (
                    [0] => 52" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 4499.99
                )
            [2] => Array
                (
                    [0] => 46" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 3699.99
                )
            [3] => Array
                (
                    [0] => 40" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 2999.99
                )
            [4] => Array
                (
                    [0] => Sony 3D 60" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 5400.99
                )
            [5] => Array
                (
                    [0] => Sony 3D 52" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 4699.99
                )
            [6] => Array
                (
                    [0] => Sony 3D 46" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 3899.99
                )
        )
    */

    foreach($_PricesTemp As $_KeyOne => $_EntryOne)
    {
        foreach(array_reverse($_PricesTemp, true) As $_KeyTwo => $_EntryTwo)
        {
            if ($_KeyOne != $_KeyTwo)
            {
                $_Percent = 0;

                similar_text(strtoupper($_EntryOne[0]), strtoupper($_EntryTwo[0]), $_Percent);

                if ($_Percent >= 90) //If names matches 90%+
                {
                    echo "Similar : <b>" . $_KeyOne . "</b> " . $_EntryOne[0] . " and <b>" . $_KeyTwo . "</b> " . $_EntryTwo[0] . " Percent : " . $_Percent . "<br />";

                    $_Prices[] = array();
                    $_Prices[ sizeOf($_Prices)-1 ]['Name'] = $_EntryOne[0]; //Use the product name of the most revelant website (website1.com)

                    foreach($_Sites As $_Site)
                    {
                        if (isset($_EntryOne[ 1 ]) && $_EntryOne[ 1 ] == $_Site) //Check if it contains price from website1.com
                        {
                            $_Prices[ sizeOf($_Prices)-1 ][ $_Site ] = $_EntryOne[ 2 ];
                        }
                        if (isset($_EntryTwo[ 1 ]) && $_EntryTwo[ 1 ] == $_Site) //Check if it contains price from website2.com
                        {
                            $_Prices[ sizeOf($_Prices)-1 ][ $_Site ] = $_EntryTwo[ 2 ];
                        }
                    }
                }
            }
        }
    }

    /*
        print_r($_Prices);

        Array
        (
            [0] => Array
                (
                    [Name] => 60" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 2999.99
                )
            [1] => Array
                (
                    [Name] => 60" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 3699.99
                )
            [2] => Array
                (
                    [Name] => 60" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 4499.99
                )
            [3] => Array
                (
                    [Name] => 52" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 2999.99
                )
            [4] => Array
                (
                    [Name] => 52" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 3699.99
                )
            [5] => Array
                (
                    [Name] => 52" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 5299.99
                )
            [6] => Array
                (
                    [Name] => 46" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 2999.99
                )
            [7] => Array
                (
                    [Name] => 46" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 4499.99
                )
            [8] => Array
                (
                    [Name] => 46" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 5299.99
                )
            [9] => Array
                (
                    [Name] => 40" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 3699.99
                )
            [10] => Array
                (
                    [Name] => 40" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 4499.99
                )
            [11] => Array
                (
                    [Name] => 40" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 5299.99
                )
            [12] => Array
                (
                    [Name] => Sony 3D 60" LX900 HDTV BRAVIA
                    [website2.com] => 3899.99
                )
            [13] => Array
                (
                    [Name] => Sony 3D 60" LX900 HDTV BRAVIA
                    [website2.com] => 4699.99
                )
            [14] => Array
                (
                    [Name] => Sony 3D 52" LX900 HDTV BRAVIA
                    [website2.com] => 3899.99
                )
            [15] => Array
                (
                    [Name] => Sony 3D 52" LX900 HDTV BRAVIA
                    [website2.com] => 5400.99
                )
            [16] => Array
                (
                    [Name] => Sony 3D 46" LX900 HDTV BRAVIA
                    [website2.com] => 4699.99
                )
            [17] => Array
                (
                    [Name] => Sony 3D 46" LX900 HDTV BRAVIA
                    [website2.com] => 5400.99
                )
        )
    */
?>

首先,上面的代码不起作用。一定有一个我无法指出的逻辑错误。此外,如果我将第三个网站添加到列表中,我不相信该代码将起作用。

有什么想法吗?从今天早上开始我就一直在做这件事。

编辑 2011-02-16:

我为这个问题添加了赏金。

4

3 回答 3

1

试试这个要点更清楚https://gist.github.com/835099

它为我产生了你想要的结果。

于 2011-02-19T06:57:09.663 回答
0

高级概述应该是这样的:

  • 创建最终结果数组 $items
  • 遍历所有网站中找到的所有项目
  • 对于每一个,检查它是否与 $items 中的任何现有项目名称足够相似
  • 如果是,则将价格添加到该密钥,如果否,则创建一个新并将其添加到那里

而不是similar_text()你应该考虑使用levenshtein()which 在实践中相似但相当快。

这是一些(未经测试,现场)代码:

$levThreshold = 3 ;

$_Prices = array() ;
foreach ($_Retreived as $website => $websiteItems) {
    $currName = $websiteItems[0] ;
    $currWebsite = $websiteItems[1] ;
    $currPrice = $websiteItems[2] ;

    $foundItemKey = false ;

    //check current price structure. Get $priceData by reference
    //so we can modify it in the loop and keep the changed instead 
    //of the loop copy.
    foreach ($_Prices as &$priceData) {

        if (isset($priceData[$website])) {
            //already done this
            continue ;
        }

        //check if this is the item name we are looping over
        $lev = levenshtein($priceData['Name'], $currName) ;

        if ($lev < $levThreshold) {
            //item exists, add price and break
            $priceData[$website] = $currPrice ;
            $foundItemKey = true ;
            break ;
        }

    }

    //if we haven't found the item key, create a new one
    if (!$foundItemKey) {
        $newItem = array() ;
        $newItem['Name'] = $currName ;
        $newItem[$website] = $currPrice ; 
        $_Prices[] = $newItem ;
    }

}

$levThreshold是两个字符串之间必须不同的最小字符数才能被视为不同。您可以相应地进行调整。

于 2010-09-09T16:47:19.043 回答
0

使用similar_text 无法回答该问题。你想匹配60" BRAVIA LX900 Series 3D HDTV. Sony 3D 60" LX900 HDTV BRAVIA然而,60" BRAVIA LX900 Series 3D HDTV实际上更类似于52" BRAVIA LX900 Series 3D HDTV,只有两个字符不同。

我怀疑您需要一个自定义处理程序来匹配特定于您尝试匹配的产品的详细信息。例如,对于您可能想要匹配尺寸 ( xx") 和产品系列 ( BRAVIA LX900) 的电视机。

这并没有为您提供问题的解决方案,但我担心答案。

于 2011-02-21T20:30:08.240 回答