xpath - 在 Google 电子表格中使用 ImportXML 和 XPath 的幻像元素

Question

我正在尝试使用 XPath 通过 Google 电子表格中的 importXML从该站点获取元素属性的值。

我寻求的属性值content在with中找到itemprop="price"。

<div class="left" style="margin-top: 10px;">
    <meta itemprop="currency" content="RON">
        <span class="pret" itemprop="price" content="698,31 RON">
            <p class="pret">Pretul tau:</p>
            698,31 RON
        </span>
...
</div>

我可以访问<div class="left">，但我无法访问该元素。

尝试使用：

//span[@class='pret']/@content我得到#N/A；
//span[@itemprop='price']/@content我得到#N/A；
//div[@class='left']/span[@class='pret' and @itemprop='price']/@content我得到#N/A；
//div[@class='left']/span[1]/@content我得到#N/A；
//div[@class='left']/span/text()得到我得到#N/A的文本节点；
//div[@class='left']//span/text()我得到了较低的文本节点div.left。

要获得i 的文本节点，必须使用//div[@class='left']/text(). 但是我不能使用那个文本节点，因为跨度的布局会在产品打折时发生变化，所以我需要这个属性。

就像我正在寻找的跨度不存在一样，尽管它出现在 Chrome 的开发视图和页面源中，并且所有 XPath 在控制台中使用$x("").

我试图通过右键单击直接从开发工具生成 XPath，但我得到//*[@id='produs']/div[4]/div[4]/div[1]/span了它不起作用。我还尝试使用 Firefox 和 FF 和 Chrome 插件生成 XPath，但无济于事。以这些方式生成的 XPath 甚至在我设法用“手工编码的 XPath”抓取的网站上都不起作用。

现在，最奇怪的是，在另一个具有明显相似代码结构的站点上，XPath//span[@itemprop='price']/@content可以工作。

我现在为此苦苦挣扎了 4 天。我开始认为这与自动关闭元标记有关，但为什么其他网站不会发生这种情况？

score 2 · Accepted Answer

Perhaps the following formulas can help you:

=ImportXML("http://...","//div[@class='product-info-price']//div[@class='left']/text()")

Or

=INDEX(ImportXML("http://...","//div[@class='product-info-price']//div[@class='left']"), 1, 2)

UPDATE

It seems that not properly parse the entire document, it fails. A document extraction, something like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<div class="product-info-price">
    <div class="left" style="margin-top: 10px;">
        <meta itemprop="currency" content="RON">
        <span class="pret" itemprop="price" content="698,31 RON">
            <p class="pret">Pretul tau:</p>
            698,31 RON
        </span>
        <div class="resealed-info">
            <a href="/resigilate/componente-pc/placi-de-baza/" rel="nofollow">» Vezi 1 resigilat din aceasta categorie</a>
        </div>
        <ul style="margin-left: auto;margin-right: auto;width: 200px;text-align: center;margin-top: 20px;">
            <li style="color: #000000; font-size: 11px;">Rata de la <b>28,18 RON</b> prin <a href="http://www.marketonline.ro/rate-sapte-stele?amount=698.31#brdfinance" title="BRD Finance" target="_blank" class="rate" rel="nofollow">BRD</a></li>
            <li style="color: #5F5F5F;text-align: center;">Pretul include TVA</li>
            <li style="color: #5F5F5F;">Cod produs: <span style="margin-left: 0;text-align: center;font-weight: bold;" itemprop="identifier" content="mol:GA-Z87X-UD3H">GA-Z87X-UD3H</span> </li>
        </ul>
    </div>
    <div class="right" style="height: 103px;line-height: 103px;">
        <form action="/?a=shopping&amp;sa=addtocart" method="post" id="add_to_cart_form">
            <input type="hidden" name="product-183641" value="on"/>
            <a href="/adaugaincos-183641" rel="nofollow"><img src="/templates/marketonline/images/pag-prod/buton_cumpara.jpg"/></a>
        </form>
    </div>
</div>
</html>

works with the following XPath query:

"//div[@class='product-info-price']//div[@class='left']//span[@itemprop='price']/@content"

UPDATE

It occurs to me that one option is that you can use Apps Script to create your own ImportXML function, something like:

/* CODE FOR DEMONSTRATION PURPOSES */
function MyImportXML(url) {
  var found, html, content = '';
  var response = UrlFetchApp.fetch(url);
  if (response) {
    html = response.getContentText();
    if (html) content = html.match(/<span class="pret" itemprop="price" content="(.*)">/gi)[0].match(/content="(.*)"/i)[1];
  }
  return content;
}

Then you can use as follows:

=MyImportXML("http://...")

score 1 · Accepted Answer

此时，第一个链接中引用的网页不包含 itemprop="price" 的 span 标签，但以下 XPath 返回639

//b[@itemprop='price']

在我看来，问题在于元标记不符合 XHTML，但现在所有元标记都已正确关闭。

前：

<meta itemprop="currency" content="RON">

现在

<meta itemprop="priceCurrency" content="RON" />

对于不符合 XHTML 的网页，应使用其他解决方案而不是 IMPORTXML，例如使用 IMPORTDATA 和 REGEXEXTRACT 或 Google Apps 脚本、UrlFetch 服务和匹配 JavaScript 函数等。

score 0 · Accepted Answer

像这样尝试：

print 'content by key',tree.xpath('//*[@itemprop="price"]')[0].get('content')

或者

nodes = tree.xpath('//div/meta/span')
for node in nodes:
    print 'content =',node.get('content')

但我没有尝试过。

xpath - 在 Google 电子表格中使用 ImportXML 和 XPath 的幻像元素

3 回答 3

Related

Reference