php - 在 php 中使用 curl 概念获取内部文本

Question

这是网站中的html文本，我要抓取

1,000 个死前必看的地方

<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>

我使用了这样的代码

foreach($html->find('ul.listings li a') as $e)
echo $e->innertext. '<br/>';

我得到的输出就像

 999: Whats Your Emergency<span class="epnum">2012</span>

包括跨度请帮助我

score 4 · Accepted Answer

为什么不DOMDocument获取标题属性？：

$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>';

$dom = new DOMDocument;
$dom->loadHTML($string);
$xpath = new DOMXPath($dom);
$text = $xpath->query('//ul[@class="listings"]/li/a/@title')->item(0)->nodeValue;
echo $text;

或者

$text = explode("\n", trim($xpath->query('//ul[@class="listings"]/li/a')->item(0)->nodeValue));
echo $text[0];

键盘示例

score 1 · Accepted Answer

我可以想到两种方法来解决这个问题。一，是您从锚标签中获取标题属性。当然，并不是每个人都为锚标签设置了标题属性，如果他们想这样填充，属性的值可能会有所不同。另一种解决方案是，您获取innertext属性，然后将锚标记的每个子标记替换为空值。

所以，要么这样做

$e->title;

或这个

$text = $e->innertext;
foreach ($e->children() as $child)
{
    $text = str_replace($child, '', $text);
}

不过，为此使用它可能是一个好主意DOMDocument。

score 0 · Accepted Answer

你可以使用strip_tags()它

echo trim(strip_tags($e->innertext));

或者尝试使用preg_replace()删除不需要的标签及其内容

echo preg_replace('/<span[^>]*>([\s\S]*?)<\/span[^>]*>/', '', $e->innertext);

score -1 · Accepted Answer

首先检查你的html。现在就像

  $string = '<ul class="listings">
               <li>
                  <a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
 1,000 Places To See Before You Die
                    <span class="epnum">2009</span>
                 </a>
             </li>';

ul 没有关闭标签，也许你错过了。

  $string = '<ul class="listings">
               <li>
                  <a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
 1,000 Places To See Before You Die
                    <span class="epnum">2009</span>
                 </a>
             </li>
            </ul>';

像这样试试

 $xml = simplexml_load_string($string);
 echo $xml->li->a['title'];

score -1 · Accepted Answer

改为使用plaintext。

echo $e->plaintext;

但是仍然会出现年份，您可以使用正则表达式对其进行修剪。

此处文档中的示例：

$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"

php - 在 php 中使用 curl 概念获取内部文本

5 回答 5

Related

Reference