2

我想在不使用 BeautifulSoup 的情况下从 python 中的 html 文件中提取标签。例如,我想得到

class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine 

<a class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine</a>

有任何想法吗?

4

2 回答 2

1

要进行基本的 dom 解析,您可以使用 stl.xml 中的xml 解析器

这是使用它(来自文档)将 xml 转换为 html 的示例:

import xml.dom.minidom

document = """\
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>

<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""

dom = xml.dom.minidom.parseString(document)

def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)

def handleSlideshow(slideshow):
    print "<html>"
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
    slides = slideshow.getElementsByTagName("slide")
    handleToc(slides)
    handleSlides(slides)
    print "</html>"

def handleSlides(slides):
    for slide in slides:
        handleSlide(slide)

def handleSlide(slide):
    handleSlideTitle(slide.getElementsByTagName("title")[0])
    handlePoints(slide.getElementsByTagName("point"))

def handleSlideshowTitle(title):
    print "<title>%s</title>" % getText(title.childNodes)

def handleSlideTitle(title):
    print "<h2>%s</h2>" % getText(title.childNodes)

def handlePoints(points):
    print "<ul>"
    for point in points:
        handlePoint(point)
    print "</ul>"

def handlePoint(point):
    print "<li>%s</li>" % getText(point.childNodes)

def handleToc(slides):
    for slide in slides:
        title = slide.getElementsByTagName("title")[0]
        print "<p>%s</p>" % getText(title.childNodes)

handleSlideshow(dom)
于 2013-07-01T01:31:31.353 回答
1

看看这个在 python 中提供的XML API,它解释了如何访问属性、元素,还有一些 HTML 示例。您还可以生成解析器对象。

于 2013-07-01T04:25:30.123 回答