html - 有没有办法让 YQL 返回 HTML？

Question

我正在尝试使用 YQL 从一系列网页中提取一部分 HTML。页面本身的结构略有不同（因此带有“剪切内容”功能的 Yahoo Pipes“获取页面”效果不佳）但我感兴趣的片段始终具有相同的class属性。

如果我有这样的 HTML 页面：

<html>
  <body>
    <div class="foo">
      <p>Wolf</p>
      <ul>
        <li>Dog</li>
        <li>Cat</li>
      </ul>
    </div>
  </body>
</html>

并使用这样的 YQL 表达式：

SELECT * FROM html 
WHERE url="http://example.com/containing-the-fragment-above" 
AND xpath="//div[@class='foo']"

我得到的是（显然是无序的？）DOM 元素，我想要的是 HTML 内容本身。我也尝试SELECT content过，但只选择文本内容。我想要 HTML。这可能吗？

score 8 · Accepted Answer

您可以编写一个小的开放数据表来发送一个普通的 YQLhtml表查询并将结果字符串化。类似于以下内容：

<?xml version="1.0" encoding="UTF-8" ?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
  <meta>
    <sampleQuery>select * from {table} where url="http://finance.yahoo.com/q?s=yhoo" and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'</sampleQuery>
    <description>Retrieve HTML document fragments</description>
    <author>Peter Cowburn</author>
  </meta>
  <bindings>
    <select itemPath="result.html" produces="JSON">
      <inputs>
        <key id="url" type="xs:string" paramType="variable" required="true"/>
        <key id="xpath" type="xs:string" paramType="variable" required="true"/>
      </inputs>
      <execute><![CDATA[
var results = y.query("select * from html where url=@url and xpath=@xpath", {url:url, xpath:xpath}).results.*;
var html_strings = [];
for each (var item in results) html_strings.push(item.toXMLString());
response.object = {html: html_strings};
]]></execute>
    </select>
  </bindings>
</table>

然后，您可以使用 YQL 查询来查询该自定义表，例如：

use "http://url.to/your/datatable.xml" as html.tostring;
select * from html.tostring where 
  url="http://finance.yahoo.com/q?s=yhoo" 
  and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li'

编辑：刚刚意识到这是一个非常古老的问题。最终，对于遇到这个问题的任何人来说，至少有一个答案。:)

score 2 · Accepted Answer

我有同样的问题。我解决它的唯一方法是避免 YQL，只使用正则表达式来匹配开始和结束标签：/。不是最好的解决方案，但如果 html 相对不变，并且模式只是从说到<div class='name'>> <div class='just_after`，那么你可以侥幸逃脱。然后你可以得到之间的html。

score 0 · Accepted Answer

YQL 将页面转换为 XML，然后在其上执行 XPath，然后获取 DOMNodeList 并将其序列化回 XML 以用于输出（然后在需要时转换为 JSON）。您无法访问原始数据。

为什么不能处理 XML 而不是 HTML？

html - 有没有办法让 YQL 返回 HTML？

3 回答 3

Related

Reference