python - Python CGI 脚本（使用 XML 和 mindom）返回意外结果

Question

我正在尝试解析搜索引擎 API（Bing、Yahoo 和 Blekko）返回的 XML。从 Blekko 返回的 XML（用于示例搜索查询 'sushi'）采用以下形式：

<rss version="2.0">
<channel>
    <title>blekko | rss for &quot;sushi/rss /ps=100&quot;</title>
    <link>http://blekko.com/?q=sushi%2Frss+%2Fps%3D100</link>
    <description>Blekko search for &quot;sushi/rss /ps=100&quot;</description>
    <language>en-us</language>
    <copyright>Copyright 2011 Blekko, Inc.</copyright>
    <docs>http://cyber.law.harvard.edu/rss/rss.html</docs>
    <webMaster>webmaster@blekko.com</webMaster>
    <rescount>3M</rescount>
    <item>
        <title>Sushi - Wikipedia</title>
        <link>http://en.wikipedia.org/wiki/Sushi</link>
        <guid>http://en.wikipedia.org/wiki/Sushi</guid>
        <description>Article about sushi, a food made of vinegared rice combined with various toppings or fillings.  Sushi ( &#x3059;&#x3057;&#x3001;&#x5bff;&#x53f8;, &#x9ba8;, &#x9b93;, &#x5bff;&#x6597;, &#x5bff;&#x3057;, &#x58fd;&#x53f8;.</description>
        </item>
</channel>
</rss>

提取所需搜索结果数据的python代码部分是：

for counter in range(100):
    try:
        for item in BlekkoSearchResultsXML.getElementsByTagName('item'):
            Blekko_PageTitle = item.getElementsByTagName('title')[counter].toxml(encoding="utf-8")
            Blekko_PageDesc = item.getElementsByTagName('description')[counter].toxml(encoding="utf-8")
            Blekko_DisplayURL = item.getElementsByTagName('guid')[counter].toxml(encoding="utf-8")
            Blekko_URL = item.getElementsByTagName('link')[counter].toxml(encoding="utf-8")
            print "<h2>" + Blekko_PageTitle + "</h2><br />"
            print Blekko_PageDesc + "<br />"
            print Blekko_DisplayURL + "<br />"
            print Blekko_URL + "<br />"
    except IndexError:
        break

该代码不会提取返回的每个搜索结果的页面标题，但会提取其余信息。

此外，如果我没有代码：

print "<title>Page title to appear on browser tab</title>"

在脚本的某处，第一个搜索结果的标题被视为页面标题（即页面在浏览器中显示为标题“Sushi - Wikipedia”）。如果我确实有页面标题，代码仍然不会从搜索结果中提取页面标题。

相同的代码（具有不同的标签名称等）在 Yahoo 搜索 API 中存在相同的问题，但在 Bing 搜索 API 中可以正常工作。

score 1 · Accepted Answer

我猜想 .toxml() 方法返回元素的 XML，包括它的定界标签，然后你会得到这样的东西：

<h2><title>...</title></h2><br />
<description>...</description><br />
<guid>...</guid><br />

因此，该title元素被解释为页面的标题，除非您事先指定自己的标题。浏览器不知道其他元素，它只是按原样显示它们的内容。

python - Python CGI 脚本（使用 XML 和 mindom）返回意外结果

1 回答 1

Related

Reference