0

我正在尝试读取标签中的所有链接,然后尝试从中创建 wiki 链接...基本上我想从 xml 文件中读取每个链接,然后使用最后一个单词创建 wiki 链接(请参阅下面的内容我的意思是链接的最后一个字)...由于某种原因遇到以下错误,我缺少什么,请建议

http://wiki.build.com/ca_builds/CIT (last word is CIT)
http://wiki.build.com/ca_builds/1.2_Archive(last word is 1.2_Archive)

输入 XML:-

<returnLink>
    http://wiki.build.com/ca_builds/CIT
    http://wiki.build.com/ca_builds/1.2_Archive
</returnLink>

蟒蛇代码

 def getReturnLink(xml):
"""Collects the link to return to the PL home page from the config file."""
if xml.find('<returnLink>') == -1:
    return None
else:
    linkStart=xml.find('<returnLink>')
    linkEnd=xml.find('</returnLink>')
    link=xml[linkStart+12:linkEnd].strip()
    link = link.split('\n')
    #if link.find('.com') == -1:
        #return None
    for line in link:
        line = line.strip()
        print "LINE"
        print line
        lastword = line.rfind('/') + 1
        line = '['+link+' lastword]<br>'
        linklis.append(line)
    return linklis

输出:-

   line = '['+link+' lastword]<br>'
 TypeError: cannot concatenate 'str' and 'list' objects

预期输出:-

CIT  (this will point to http://wiki.build.com/ca_builds/CIT
1.2_Archive (this will point to http://wiki.build.com/ca_builds/1.2_Archive 1.2_Archive)
4

3 回答 3

1

Python 标准库有 xml 解析器。您还可以<returnLink>在一个 url 中支持多个元素和 Unicode 单词:

import posixpath
import urllib
import urlparse
from xml.etree import cElementTree as etree

def get_word(url):
    basename = posixpath.basename(urlparse.urlsplit(url).path)
    return urllib.unquote(basename).decode("utf-8")

urls = (url.strip()
        for links in etree.parse(input_filename_or_file).iter('returnLink')
        for url in links.text.splitlines())
wikilinks = [u"[{} {}]".format(url, get_word(url))
             for url in urls if url]
print(wikilinks)

注意:在内部使用 Unicode。仅将文本转换为字节以与外部世界通信,例如,在写入文件时。

例子

[http://wiki.build.com/ca_builds/CIT#some-fragment CIT]
[http://wiki.build.com/ca_builds/Unicode%20%28%E2%99%A5%29 Unicode (♥)]
于 2012-11-11T11:39:06.243 回答
0

我在理解您的问题时遇到了一些困难,但您似乎只想在链接中的最后一个“/”字符之后返回字符串?您可以使用反向查找来执行此操作。

return link[link.rfind('/') + 1:]
于 2012-11-11T02:57:04.970 回答
0

无需手动解析 XML,请使用如下库lxml

>>> s = """<returnLink>
...     http://wiki.build.com/ca_builds/CIT
...     http://wiki.build.com/ca_builds/1.2_Archive
... </returnLink>"""
>>> from lxml import etree
>>> xml_tree = etree.fromstring(s)
>>> links = xml_tree.text.split()
>>> for i in links:
...    print '['+i+']'+i[i.rfind('/')+1:]
...
[http://wiki.build.com/ca_builds/CIT]CIT
[http://wiki.build.com/ca_builds/1.2_Archive]1.2_Archive

我不确定您所说的 wikilinks 是什么意思,但以上内容应该让您了解如何解析字符串。

于 2012-11-11T03:59:51.527 回答