php - 如何使用 Goutte 和 Symfony DomCrawler 从父 div 中过滤子节点值，其中 style = "..."？

Question

我正在尝试使用 php 包Goutte从给定的 wikiquote 页面中抓取引号，该包包含 Symfony 组件：BrowserKit、CssSelector 和DomCrawler。

但是，我的结果集中有一些我不想要的引号，即来自错误分配部分的引号。

这是我到目前为止所拥有的：

use Goutte\Client;

$client = new Client();

$crawler = $client->request('GET', 'http://en.wikiquote.org/wiki/Thomas_Jefferson');

//grab all the children li's from the wikiquote page
$quotes = $crawler->filter('ul > li');

$quoteArray = [];

//foreach li with a node value that does not start with a number, push the node value onto quote array
//this filters out the table of contents <li> node values which I do not want

foreach($quotes as $quote)
{
    if(!is_numeric(substr($quote->nodeValue, 0, 1)))
    {
        array_push($quoteArray, $quote->nodeValue);
    }
}

我现在关注的问题是如何从错误分配的部分中过滤掉引号。此部分包含在div具有以下style属性的父级中：

style="padding: .5em; border: 1px solid black; background-color:#FFE7CC"

我在想，如果我能以某种方式li从这个特定部分获取节点值，我就可以从上面过滤掉它们$quoteArray。我遇到的问题是我无法弄清楚如何li从此部分中选择子节点值。

我尝试选择具有以下变化的孩子：

$badQuotes = $crawler->filter('div[style="padding: .5em; border: 1px solid black; background-color:#FFE7CC"] > ul > li');

但这并没有返回我需要的节点值。有谁知道该怎么做或我做错了什么？

score 0 · Accepted Answer

DomCrawler过滤器方法将

使用 CSS 选择器过滤节点列表。

这不如使用 xpath 强大。我猜 CSS 选择器无法将您的复杂查询转换为 xpath 表达式。所以，一个复杂的过滤器应该由filterXPath方法代替，这将

使用 XPath 表达式过滤节点列表。

因此，在您的情况下，请尝试使用以下filterXPath方法：

$crawler->filterXPath("//div[contains(@style,'padding: .5em; border: 1px solid black; background-color:#FFE7CC')]");

php - 如何使用 Goutte 和 Symfony DomCrawler 从父 div 中过滤子节点值，其中 style = "..."？

1 回答 1

Related

Reference