0

我有一个 SimpleXML 对象,它是通过合并来自 PubMed 的多个 XML(下面的片段)制成的,但是合并中有重复。如何比较所有第一个子数组 - array[][0]、array[][1] 等 - 并丢弃任何重复项?我虽然也许序列化是答案,但你不能序列化 SimpleXML 对象 afaik ..

我不知道从哪里开始?

Array
(
  [0] => Array
    (
        [title] => SimpleXMLElement Object
            (
                [0] => Superstructure of the centromeric complex of TubZRC plasmid partitioning systems.
            )

        [link] => SimpleXMLElement Object
            (
                [@attributes] => Array
                    (
                        [Version] => 1
                    )

                [0] => 23010931
            )

        [author] => Aylett, CH., Löwe, J.
        [journal] => SimpleXMLElement Object
            (
                [0] => Proc. Natl. Acad. Sci. U.S.A.
            )

        [pubdate] => 2012-9-27
        [day] => SimpleXMLElement Object
            (
                [0] => 25
            )

        [month] => SimpleXMLElement Object
            (
                [0] => Sep
            )

        [year] => SimpleXMLElement Object
            (
                [0] => 2012
            )

    )
    [1] => Array
    (
        [title] => SimpleXMLElement Object
            (
                [0] => Superstructure of the centromeric complex of TubZRC plasmid partitioning systems.
            )

        [link] => SimpleXMLElement Object
            (
                [@attributes] => Array
                    (
                        [Version] => 1
                    )

                [0] => 23010931
            )

        [author] => Aylett, CH., Löwe, J.
        [journal] => SimpleXMLElement Object
            (
                [0] => Proc. Natl. Acad. Sci. U.S.A.
            )

        [pubdate] => 2012-9-27
        [day] => SimpleXMLElement Object
            (
                [0] => 25
            )

        [month] => SimpleXMLElement Object
            (
                [0] => Sep
            )

        [year] => SimpleXMLElement Object
            (
                [0] => 2012
            )

    )

或者,它可以在初始 XML 合并阶段完成 - 如果有人可以建议如何修改它以删除重复项,我现在使用下面的代码?

function simplexml_merge (SimpleXMLElement &$xml1, SimpleXMLElement $xml2) {
    $dom1 = new DomDocument();
    $dom2 = new DomDocument();

    $dom1->loadXML($xml1->asXML());
    $dom2->loadXML($xml2->asXML());

    $xpath = new domXPath($dom2);
    $xpathQuery = $xpath->query('/*/*');
    for ($i = 0; $i < $xpathQuery->length; $i++) {
        $dom1->documentElement->appendChild(
        $dom1->importNode($xpathQuery->item($i), true));
    }
    $xml1 = simplexml_import_dom($dom1);
}

$xml1 = new SimpleXMLElement($search1);
$xml2 = new SimpleXMLElement($search2);

simplexml_merge($xml1, $xml2);

谢谢。

……

为清楚起见 - 这是我要导入 SimpleXML 的 XML 源布局 - 每个 PubmedArticle 都是一个“元素”,我有兴趣比较并确保没有重复 -

    <xml...>
    <Document>
        <PubmedArticle>
            <MedlineCitation>
                <PMID version="1">xxx</PMID>
                ...
            </MedlineCitation>
            ...
        </PubmedArticle>
        <PubmedArticle>
            <MedlineCitation>
                <PMID version="1">xxx</PMID>
                ...
            </MedlineCitation>
            ...
        </PubmedArticle>
        etc
     </Document>
     </xml>

PMID 节点是唯一的,因此可用于检查重复项。

……

使用来自@Gordon 的链接 - 我知道使用:

//Get my source XML
$xml1 = new SimpleXMLElement($search1);
$xml2 = new SimpleXMLElement($search2);

//Run through $xml1 and build a query based on it's PMIDs
$query = array();
foreach ($xml1->PubmedArticle as $paper) {
    $query[] = sprintf('(PMID != %s)',$paper->MedlineCitation->PMID);
}
$query = implode('and', $query);

//Run through $xml2 and get node which don't have PMID matching $xml1
foreach ($xml2->xpath(sprintf('PubmedArticle/MedlineCitation[%s]', $query)) as $paper) {
    echo $paper->asXml();
}

但是我仍然有一个问题 - 合并输出。的输出$xml2首先缺少<PubmedArticle>每个“匹配”周围的节点。然后我假设我可以使用相同的合并代码(上面)来进行合并。你能为我指出正确的方向吗?

4

2 回答 2

1

将其转换为数组(我不会为您编写,只是迭代和添加。),然后array_diff().

于 2012-10-01T13:41:25.287 回答
0

决定遵循@Gordon 的路线,因为它保留了 XML。最终一切正常:

//function to check 2 xml inputs for duplicate nodes
    function dedupeXML($xml1, $xml2) {
        $query = array();
        foreach ($xml1->PubmedArticle as $paper) {
            $query[] = sprintf('(MedlineCitation/PMID != %s)',$paper->MedlineCitation->PMID);
        }
        $query = implode('and', $query);

        $xmlClean = '<Document>';
        foreach ($xml2->xpath(sprintf('PubmedArticle[%s]', $query)) as $paper) {
            $xmlClean .= $paper->asXML();
        }
        $xmlClean .= '</Document>';
        $xmlClean = new SimpleXMLElement($xmlClean);
        return $xmlClean;
    }
//function to merge 2 xml inputs
    function mergeXML (SimpleXMLElement &$xml1, SimpleXMLElement $xml2) {
        // convert SimpleXML objects into DOM ones
        $dom1 = new DomDocument();
        $dom2 = new DomDocument();
        $dom1->loadXML($xml1->asXML());
        $dom2->loadXML($xml2->asXML());
        // pull all child elements of second XML
        $xpath = new domXPath($dom2);
        $xpathQuery = $xpath->query('/*/*');
        for ($i = 0; $i < $xpathQuery->length; $i++) {
            // and pump them into first one
            $dom1->documentElement->appendChild(
            $dom1->importNode($xpathQuery->item($i), true));
        }
        $xml = simplexml_import_dom($dom1);
        return $xml;
    }

    $xml1 = new SimpleXMLElement($search1);
    $xml2 = new SimpleXMLElement($search2);
    $xml3 = new SimpleXMLElement($search3);
    //dedupe and merge inputs
    //input 1 & 2
    $xml2Clean = dedupeXML($xml1, $xml2);
    $xml12 = mergeXML($xml1, $xml2Clean);
    //input 1+2 & 3
    $xml3Clean = dedupeXML($xml12, $xml3);
    $xml123 = mergeXML($xml12, $xml3Clean);

这将很容易适应其他数据源 - 只需修改dedupeXML函数以匹配 XML 的数据结构。

于 2012-10-11T08:45:46.817 回答