0

我正在尝试做一个简单的提取,但我总是得到不可预测的结果。

我有这个 HTML 代码

<div class="thread" style="margin-bottom:25px;"> 

<div class="message"> 

<span class="profile">Suzy Creamcheese</span> 

<span class="time">December 22, 2010 at 11:10 pm</span> 

<div class="msgbody"> 

<div class="subject">New digs</div> 

Hello thank you for trying our soap. <BR>  Jim.

</div> 
</div> 


<div class="message reply"> 

<span class="profile">Lars Jörgenmeier</span> 

<span class="time">December 22, 2010 at 11:45 pm</span> 

<div class="msgbody"> 

I never sold you any soap.

</div> 

</div> 

</div> 

而且我正在尝试从“msgbody”中提取外部文本,但前提是“profile”等于某个值。像这样。

$contents  = $html->find('.msgbody');
$elements = $html->find('.profile'); 

           $length = sizeof($contents);

           while($x != sizeof($elements)) {

            $var = $elements[$x]->outertext;

                        //If profile = the right name
            if ($var = $name) {

                                    $text = $contents[$x]->outertext;
                echo $text;

            }



            $x++;
         }    

我从错误的配置文件中获取文本,而不是那些具有我需要的关联的配置文件。有没有办法只用一行代码提取所需的信息?

就像如果 span-profile = "correct name" 然后拉它的 div-msgbody

4

2 回答 2

3

好的,我将在这个上使用DOMXpath。我不确定“外部文本”应该是什么意思,但我会满足这个要求:

就像如果 span-profile = "correct name" 然后拉它的 div-msgbody

首先,这是我使用的缩小的 HTML 测试用例:

<html>
<body>
<div class="thread" style="margin-bottom:25px;"> 

<div class="message"> 

<span class="profile">Suzy Creamcheese</span> 

<span class="time">December 22, 2010 at 11:10 pm</span> 

<div class="msgbody"> 

<div class="subject">New digs</div> 

Hello thank you for trying our soap. <BR>  Jim.

</div> 
</div> 


<div class="message reply"> 

<span class="profile">Lars Jörgenmeier</span> 

<span class="time">December 22, 2010 at 11:45 pm</span> 

<div class="msgbody"> 

I never sold you any soap.

</div> 

</div> 

</div>
</body>
</html>

因此,我们将为此进行 XPath 查询。让我们展示整个事情,然后分解它:

$messages = $xpath->query("//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']");

分解:

//跨度

给我跨度

//跨度[@class='profile']

给我班级是个人资料的跨度

//span[@class='profile' and contains(.,'$profile_name')]

给我跨度,其中类是 profile 并且跨度内部包含$profile_name,这是您所追求的名称

//span[@class='profile' and contains(.,'$profile_name')]/../

给我跨度,其中类是 profile 并且跨度的内部包含$profile_name,这是你现在所追求的名称上升一个级别,这让我们<div class="message">

//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']

给我跨度,其中类是 profile 并且跨度的内部包含$profile_name,这是你现在追求的名称,这让我们到<div class="message">最后,给我<div class="message"> 类是 msgbody 下的所有 div

现在,这是一个 PHP 代码示例:

$doc = new DOMDocument();
$doc->loadHTMLFile("test.html");

$xpath = new DOMXpath($doc);
$profile_name = 'Lars Jörgenmeier';
$messages = $xpath->query("//span[@class='profile' and contains(.,'$profile_name')]/../div[@class='msgbody']");
foreach ($messages as $message) {
  echo trim("{$message->nodeValue}") . "\n";
}

XPath 就像这样非常强大。我建议查看基本教程,然后如果您想查看更高级的用法,可以查看XPath 标准。

于 2011-05-22T00:41:41.650 回答
0

这是一个简单的 HTML DOM 工作示例。

我更改了您的示例 html,因此 Suzy Creamcheese 的配置文件不止一个,如下所示:(文件:test_class_class.htm)

 <div class="message"> 
   <span class="profile">Suzy Creamcheese</span> 
   <span class="time">December 22, 2010 at 11:10 pm</span> 
   <div class="msgbody"> 
     <div class="subject">New digs</div> 
       Hello thank you for trying our soap. <BR>  Jim.
     </div> 
   </div> 

   <div class="message reply"> 
     <span class="profile">Lars Jörgenmeier</span> 
     <span class="time">December 22, 2010 at 11:45 pm</span> 
     <div class="msgbody"> 
       I never sold you any soap.
     </div> 
   </div> 
 </div>

 <div class="message"> 
   <span class="profile">Suzy Yogurt</span> 
   <span class="time">December 22, 2010 at 11:10 pm</span> 
   <div class="msgbody"> 
     <div class="subject">No Creamcheese</div> 
       This is not Suzy Creamcheese <BR>  Jim.
     </div> 
   </div> 

   <div class="message reply"> 
     <span class="profile">Suzy Creamcheese</span> 
     <span class="time">December 22, 2010 at 11:45 pm</span> 
     <div class="msgbody"> 
       A reply from Suzy Creamcheese.
     </div> 
   </div> 
 </div>

</div>

这是我使用简单 HTML DOM 的测试: include('simple_html_dom.php');

function getMessage_for_profile($iUrl,$iProfile)
{
    // create HTML DOM
    $html = file_get_html($iUrl);

    // get text elements
    $aoProfile = $html->find('span[class=profile]'); 
    echo "Found ".count($aoProfile)." profiles.<br />";

    foreach ($aoProfile as $key=>$oProfile)
    {
      if ($oProfile->plaintext == $iProfile)
      {
        echo "<b>Profile ".$key.": ".$oProfile->plaintext."</b><br />";
// Using $e->next_sibling ()
        $oCurrent = $oProfile;
        while ($oNext = $oCurrent->next_sibling())
        {
           if ( $oNext->class == "msgbody" )
           {
             echo "<hr />";
             echo $oNext->outertext;
             echo "<hr />";
           }
           $oCurrent = $oNext;
        }
      }         
    }

    // clean up memory
    $html->clear();
    unset($html);

    return;
}
// --------------------------------------------
// test it!
// user_agent header...
ini_set('user_agent', 'My-Application/2.5');

getMessage_for_profile('test_class_class.htm','Suzy Creamcheese');
echo "<br /><br /><br />";
getMessage_for_profile('test_class_class.htm','Suzy Yogurt');

我的输出是:

Found 4 profiles.
Profile 0: Suzy Creamcheese
--------------------------------
New digs
Hello thank you for trying our soap.
Jim.
---------------------------------
Profile 3: Suzy Creamcheese
---------------------------------
A reply from Suzy Creamcheese.
---------------------------------



Found 4 profiles.
Profile 2: Suzy Yogurt
---------------------------------
No Creamcheese
This is not Suzy Creamcheese
Jim.
---------------------------------

看看它可以用简单的 HTML DOM 来完成,因为我已经知道 DOM 是如何工作的......或者足以惹上麻烦......我不必学习任何已知的语法!

于 2011-07-02T09:17:19.400 回答