1

I have the following snippet of a large xml file. I would like to extract specific namespaces, such as xmlns:dc="http://purl.org/dc/elements/1.1/". currently i am able to do this like follows:

tree = etree.parse(file)
    for element in tree.getiterator('{http://www.openarchives.org/OAI/2.0/}record'):
        for leaf in element.getiterator('{http://purl.org/dc/elements/1.1/}subject'):
            print(leaf)

the problem is that I wish to get data for multiple tags in the {http://purl.org/dc/elements/1.1/} namespace. I would also like to simplify things and have been looking at how to use xpath, but cannot seem to figure it out. Can I use xpath and if so how, and more importantly is it better for my goals?

Here is the xml:

<?xml version="1.0" encoding="UTF-8" ?>



<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
 http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2013-08-15T23:24:55Z</responseDate>
<request verb="ListRecords" resumptionToken="0/500/121403/nsdl_dc/null/null/null">http://nsdldev.org/oai</request>

<!-- Showing records 501 through 1000 out of 121403 total  -->

<ListRecords>


  <record>
    <header>
      <identifier>oai:nsdl.org:2200/20110926115158975T</identifier>
      <datestamp>2013-05-29T16:44:49Z</datestamp>
       <setSpec>ncs-NSDL-COLLECTION-000-003-112-056</setSpec>
      </header>
    <metadata>
    <nsdl_dc:nsdl_dc xmlns:nsdl_dc="http://ns.nsdl.org/nsdl_dc_v1.02/"
                 xmlns:dc="http://purl.org/dc/elements/1.1/"
                 xmlns:dct="http://purl.org/dc/terms/"
                 xmlns:lar="http://ns.nsdl.org/schemas/dc/lar"
                 xmlns:ieee="http://www.ieee.org/xsd/LOMv1p0"
                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                 schemaVersion="1.02.020"
                 xsi:schemaLocation="http://ns.nsdl.org/nsdl_dc_v1.02/ http://ns.nsdl.org/schemas/nsdl_dc/nsdl_dc_v1.02.xsd">
   <lar:readiness xsi:type="lar:Ready">Fully ready</lar:readiness>
   <dc:identifier xsi:type="dct:URI">http://www.exo.net/~emuller/activities/Hot%20Sauce%20Hot%20Spots.pdf</dc:identifier>
   <dc:relation xsi:type="nsdl_dc:NSDLPartnerURL">http://howtosmile.org/record/4427</dc:relation>
   <dc:title>Hot Sauce Hot Spots</dc:title>
   <dc:description>In this activity, learners model hot spot island formation, orientation and progression with condiments. Learners squirt a thick condiment sauce on a coarsely woven fabric to model how volcanic island hot spots form.</dc:description>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Oceanography</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Anthropology</dc:subject>
   <dc:subject>Physical science</dc:subject>
   <dc:subject>Physics</dc:subject>
   <dc:subject>General science</dc:subject>
   <dc:subject>hot spot island</dc:subject>
   <dc:subject>volcano</dc:subject>
   <dc:subject>tectonic plates</dc:subject>
   <dc:subject>Earth</dc:subject>
   <dc:subject>molten</dc:subject>
   <dc:subject>magma</dc:subject>
   <dc:subject>eruption</dc:subject>
   <dc:subject>undersea</dc:subject>
   <dc:subject>ocean</dc:subject>
   <dc:subject>island</dc:subject>
   <dc:subject>Earth Processes</dc:subject>
   <dc:subject>Volcanoes and Plate Tectonics</dc:subject>
   <dc:subject>Earth Structure</dc:subject>
   <dc:subject>Rocks and Minerals</dc:subject>
   <dc:subject>Oceans and Water</dc:subject>
   <dc:subject>Geologic Time</dc:subject>
   <dc:subject>Heat and Temperature</dc:subject>
   <dc:subject>Conducting Investigations</dc:subject>
   <dc:language>en-US</dc:language>
   <dc:format>application/pdf</dc:format>
   <lar:accessMode xsi:type="lar:ModeAcc">visual</lar:accessMode>
   <lar:accessMode xsi:type="lar:ModeAcc">tactile</lar:accessMode>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Upper Elementary</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Middle School</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">High School</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Informal Education</dct:educationLevel>
   <dct:audience xsi:type="nsdl_dc:NSDLAudience">Learner</dct:audience>
   <dc:type xsi:type="nsdl_dc:NSDLType">Activity</dc:type>
   <dc:type xsi:type="nsdl_dc:NSDLType">Model</dc:type>
   <dct:isPartOf>http://www.exo.net/~emuller/activities/index.html</dct:isPartOf>
   <dc:date xsi:type="dct:W3CDTF">2007</dc:date>
   <dc:creator>Eric Muller</dc:creator>
   <dc:contributor>The Exploratorium</dc:contributor>
   <dct:accessRights xsi:type="nsdl_dc:NSDLAccess">Free access</dct:accessRights>
   <dc:rights>Copyright 2007 Do Science</dc:rights>
   <dct:license>Owner license</dct:license>
   <lar:licenseProperty xsi:type="lar:LicProp">Terms of use unknown</lar:licenseProperty>
   <dct:rightsHolder>Do Science</dct:rightsHolder>
   <lar:metadataTerms>The following entity, University Corporation for Atmospheric Research (UCAR), has claims on the use of this metadata. This claim is as follows: The National Science Digital Library (NSDL), located at the University Corporation for Atmospheric Research (UCAR), provides these metadata terms: These data and metadata may not be reproduced, duplicated, copied, sold, or otherwise exploited for any commercial purpose that is not expressly permitted by NSDL. The entity provided more information at: http://nsdl.org/help/terms-of-use</lar:metadataTerms>
   <lar:metadataTerms>The National Science Digital Library (NSDL), located at the University Corporation for Atmospheric Research (UCAR), provides these metadata terms: These data and metadata may not be reproduced, duplicated, copied, sold, or otherwise exploited for any commercial purpose that is not expressly permitted by NSDL. More information is available at: http://nsdl.org/help/terms-of-use.</lar:metadataTerms>
</nsdl_dc:nsdl_dc>

    </metadata>
  </record>
4

2 回答 2

4

不清楚您到底想访问什么,但请尝试以下操作:

from lxml import etree
doc=etree.parse( xmlfile )
ns={'dc': 'http://purl.org/dc/elements/1.1/', 
  'oai': 'http://www.openarchives.org/OAI/2.0/'}
doc.xpath( '//dc:subject' , namespaces=ns ) # get all of the dc:subjects
doc.xpath( '//dc:*', namespaces=ns )  # get all elements in dc: namespace
# more specific path 
doc.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*/dc:*', namespaces=ns )
x=doc.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*' )
x[0].xpath( '*[contains(.,"Geo")]' )  # you can also call xpath from non document nodes
x[0].xpath( 'dc:subject/text()' , namespaces=ns ) # get the text of dc:subjects

并阅读一些关于 python 或 lxml 文档之外的 xpath 的文档。它们告诉你如何在 python 中使用 xpath,但它们并不是真正的 xpath 教程。

请注意,find()、findall() 方法采用ElementPaths,它们是 xpath 类表达式的一种有限子集。

于 2013-08-16T04:47:04.740 回答
0
for element in tree.findall(".//{http://purl.org/dc/elements/1.1/}subject"):
    print element
于 2013-08-16T03:02:51.153 回答