0

在我的数据挖掘项目中,我得到了一个复杂的、巨大的多维数组,其中包含我需要的所有信息,除了我必须在处理它之前对其执行“修复”。我已经编写了一些代码来解决这个问题,但是对于我必须“修复”的大量数据来说,它花费的时间太长了,我希望有人可以帮助我找到更有效的解决方案。

本质上,我正在使用的数组类型首先由整数索引,就像任何普通数组一样,即$x[0], $x[1], $x[2],除了每个元素都是包含我需要的键对值的关联数组(例如$x[0]['item'], $x[0]['price']),但是其中一个键存储得更深一些,即 ID。

ID 号存在于数组中$x[0]['@attributes']['id'],我想通过复制此信息以及其他密钥对来简化结构,例如$x[0]['id'].

我正在使用的数据集很大,但这是我的情况的简化示例:

$attrib1 = array('id'=>'101');
$item1 = array('@attributes'=>$attrib1, 'item'=>'milk', 'price'=>'3.50');
$attrib2 = array('id'=>'102');
$item2 = array('@attributes'=>$attrib2, 'item'=>'butter', 'price'=>'2.45');
$attrib3 = array('id'=>'103');
$item3 = array('@attributes'=>$attrib3, 'item'=>'bread', 'price'=>'1.19');
$items = array($item1, $item2, $item3);
echo "Starting data - items using itemid as attribute:\n";
print_r($items);

# set item numbers by key instead of attribute
$i=0;
while(isset($items[$i]['@attributes']['id'])) {
   $items[$i]['itemid'] = $items[$i]['@attributes']['id'];
   #unset($items[$i]['@attributes']);
   $i++;
} # while
echo "\nDesired result - items using itemid as key:\n";
print_r($items);

以下是上述示例的输出:

Starting data - items using itemid as attribute:
Array
(
    [0] => Array
        (
            [@attributes] => Array
                (
                    [id] => 101
                )

            [item] => milk
            [price] => 3.50
        )

    [1] => Array
        (
            [@attributes] => Array
                (
                    [id] => 102
                )

            [item] => butter
            [price] => 2.45
        )

    [2] => Array
        (
            [@attributes] => Array
                (
                    [id] => 103
                )

            [item] => bread
            [price] => 1.19
        )

)

Desired result - items using itemid as key:
Array
(
    [0] => Array
        (
            [@attributes] => Array
                (
                    [id] => 101
                )

            [item] => milk
            [price] => 3.50
            [itemid] => 101
        )

    [1] => Array
        (
            [@attributes] => Array
                (
                    [id] => 102
                )

            [item] => butter
            [price] => 2.45
            [itemid] => 102
        )

    [2] => Array
        (
            [@attributes] => Array
                (
                    [id] => 103
                )

            [item] => bread
            [price] => 1.19
            [itemid] => 103
        )

)

请注意在所需结果中添加的 [itemid] 键值对。有没有更快/更优雅的方式来实现这一点?我已经查看了 PHP 的一些花哨的数组函数,但我无法绕过这种更复杂的情况来使用它们。有任何想法吗?

4

3 回答 3

2

内存效率

PHP DOC 评论:内存占用splFixedArray大约37%是相同大小的常规“数组”。

splFixedArray还实现Iterator了这意味着它封装了列表并一次暴露一个元素的可见性,从而使它们更加高效。

foreach循环复制传递给它的任何数组。如果您正在处理大量数据,直接将其与我们的数组一起使用可能会导致性能问题

另请参阅 PHP 数组(和值)到底有多大?(提示:大!)

你可以试试

$it = SplFixedArray::fromArray($items);
foreach ( $it as $value ) {
    // Play with big array
}

速度

这是一个简单的基准

set_time_limit(0);
echo "<pre>";

$total = 10000;
$item = array("milk","butter","bread");
$items = array();

// Generating Random Data
for($i = 0; $i < $total; $i ++) {
    $att = array('id' => $i);
    $items[] = array('@attributes' => $att,'item' => $item[$i % 3],'price' => mt_rand(100, 5000) / 100);
}
// Pure array no copy
function m1($array) {
    foreach ( $array as $k => $v ) {
        isset($v['@attributes']) and $array[$k]['id'] = $v['@attributes']['id'];
        unset($array[$k]['@attributes']);
    }
    return $array;
}

// Array clean copy
function m2($array) {
    $items = array();
    foreach ( $array as $k => $v ) {
        isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
        $items[$k]['item'] = $v['item'];
        $items[$k]['price'] = $v['price'];
    }
    return $items;
}

// Array Iterator
function m3($array) {
    $it = new ArrayIterator($array);
    $items = array();
    foreach ( $it as $k => $v ) {
        isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
        $items[$k]['item'] = $v['item'];
        $items[$k]['price'] = $v['price'];
    }
    return $items;
}

// SplFixedArray Array
function m4($array) {
    $it = SplFixedArray::fromArray($array);
    $items = array();
    foreach ( $it as $k => $v ) {
        isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
        $items[$k]['item'] = $v['item'];
        $items[$k]['price'] = $v['price'];
    }
    return $items;
}

// Array Map
function m5($array) {
    $items = array_map(function ($v) {
        isset($v['@attributes']) and $v['id'] = $v['@attributes']['id'];
        unset($v['@attributes']);
        return $v;
    }, $array);
    return $items;
}

// Array Walk
function m6($array) {
    array_walk($array, function (&$v, $k) {
        isset($v['@attributes']) and $v['id'] = $v['@attributes']['id'];
        unset($v['@attributes']);
        return $v;
    });
    return $array;
}

$result = array('m1' => 0,'m2' => 0,'m3' => 0,'m4' => 0,'m5' => 0,'m6' => 0);

for($i = 0; $i < 1; ++ $i) {
    foreach ( array_keys($result) as $key ) {
        $alpha = microtime(true);
        $key($items);
        $result[$key] += microtime(true) - $alpha;
    }
}

echo '<pre>';
echo "Single Run\n";
print_r($result);
echo '</pre>';

$result = array('m1' => 0,'m2' => 0,'m3' => 0,'m4' => 0,'m5' => 0,'m6' => 0);

for($i = 0; $i < 2; ++ $i) {
    foreach ( array_keys($result) as $key ) {
        $alpha = microtime(true);
        $key($items);
        $result[$key] += microtime(true) - $alpha;
    }
}

echo '<pre>';
echo "Dual Run\n";
print_r($result);
echo '</pre>';

它有一个非常有趣的结果

PHP 5.3.10

Single Run
Array
(
    [m1] => 0.029280185699463 <--------------- fastest
    [m2] => 0.038463115692139
    [m3] => 0.049274921417236
    [m4] => 0.03856086730957
    [m5] => 0.032699823379517
    [m6] => 0.032186985015869
)

Dual Run
Array
(
    [m1] => 0.068470001220703
    [m2] => 0.077174663543701
    [m3] => 0.085768938064575
    [m4] => 0.07695198059082
    [m5] => 0.073209047317505
    [m6] => 0.065080165863037 <--------------- Fastest after in 2 loops
)

PHP 5.4.1

Single Run
Array
(
    [m1] => 0.029529094696045
    [m2] => 0.035377979278564
    [m3] => 0.03830099105835
    [m4] => 0.034613132476807
    [m5] => 0.031363010406494
    [m6] => 0.028403043746948  <---------- fastest
)

Dual Run
Array
(
    [m1] => 0.072367191314697
    [m2] => 0.071731090545654
    [m3] => 0.078131914138794
    [m4] => 0.075049877166748
    [m5] => 0.065959930419922
    [m6] => 0.060923099517822  <---------- Fastest
)
于 2012-10-25T21:37:16.470 回答
1

看起来它来自 XML,所以我要补充一点,@attributes 中可能不仅仅包含 ID。但假设不会发生这种情况,您可以尝试使用 foreach 代替,尽管我不确定速度增益。

可能会产生影响,因为您正在修改您正在循环的同一个数组(虽然我找不到这方面的证据,所以需要进行实验)

$cleanedArray = array();
foreach($bigArray as $subArray)
{
  if(isset($subArray['@attributes']))
  {
     $subArray['itemid'] = $subArray['@attributes']['id'];
    unset($subArray['@attributes']); //Optional
    $cleanedArray[] = $subArray;
  }
}

抱歉,如果这最终变慢

编辑:添加了缺少的索引

于 2012-10-25T21:33:52.643 回答
0

这与其说是一个答案,不如说是对所提供方法的比较:

我使用这个脚本来平均算法所花费的时间:

<?php
//base data
$attrib1 = array('id'=>'101');
$item1 = array('@attributes'=>$attrib1, 'item'=>'milk', 'price'=>'3.50');
$attrib2 = array('id'=>'102');
$item2 = array('@attributes'=>$attrib2, 'item'=>'butter', 'price'=>'2.45');
$attrib3 = array('id'=>'103');
$item3 = array('@attributes'=>$attrib3, 'item'=>'bread', 'price'=>'1.19');
$results = array('test1'=>array(),'test2'=>array(),'test3'=>array());

//set trials
$trials=1000;

//test 1
for($count=0;$count<$trials;$count++){
unset($items);
$items = array($item1, $item2, $item3);
$timer1=microtime();
$i=0;
while(isset($items[$i]['@attributes']['id'])) {
   $items[$i]['itemid'] = $items[$i]['@attributes']['id'];
   $i++;
}
$timer1=microtime()-$timer1;
$results['test1'][$count]=$timer1;
}

//test 2
for($count=0;$count<$trials;$count++){
unset($items);
unset($cleanedArray);
$items = array($item1, $item2, $item3);
$cleanedArray = array();
$timer2=microtime();
foreach($items as $subArray)
{
  if(isset($subArray['@attributes']))
  {
    unset($subArray['@attributes']);
    $cleanedArray[] = $subArray;
  }
}
$timer2=microtime()-$timer2;
$results['test2'][$count]=$timer2;
}

//test 3
for($count=0;$count<$trials;$count++){
unset($items);
unset($it);
$items = array($item1, $item2, $item3);
$it = SplFixedArray::fromArray($items);
$timer3=microtime();
foreach($it as $subArray)
{
  if(isset($subArray['@attributes']))
  {
    unset($subArray['@attributes']);
    $cleanedArray[] = $subArray;
  }
}
$timer3=microtime()-$timer3;
$results['test3'][$count]=$timer3;
}

//results
$factor=pow(10,-6);
echo "Test 1 averaged " . round(array_sum($results['test1']) / count($results['test1'])/$factor,1) . " µs, with range: " . round((max($results['test1'])-min($results['test1']))/$factor,1) . " µs - (min: " . (min($results['test1'])/$factor) . ", max: " . (max($results['test1'])/$factor) . ")<br/>";

echo "Test 2 averaged " . round(array_sum($results['test2']) / count($results['test2'])/$factor,1) . " µs, with range: " . round((max($results['test2'])-min($results['test2']))/$factor,1) . " µs - (min: " . (min($results['test2'])/$factor) . ", max: " . (max($results['test2'])/$factor) . ")<br/>";

echo "Test 3 averaged " . round(array_sum($results['test3']) / count($results['test3'])/$factor,1) . " µs, with range: " . round((max($results['test3'])-min($results['test3']))/$factor,1) . " µs - (min: " . (min($results['test3'])/$factor) . ", max: " . (max($results['test3'])/$factor) . ")<br/>";

echo "<pre>";
var_dump($results);
echo "</pre>";

此处的结果在试验次数较少时变化很大,但如果基本数组较大且运行的试验次数较多,则结果应该会变得更加偏斜。

于 2012-10-25T22:54:09.147 回答