python - 如何使用 python 解析 XML 层次结构？

Question

我是 python 新手，一直在承担各种项目以跟上进度。目前，我正在制定一个例程，该例程将通读联邦法规，并为每一段打印该段的组织层次结构。例如，CFR 的 XML 方案的简化版本如下所示：

<CHAPTER>
<HD SOURCE="HED">PART 229—NONDISCRIMINATION ON THE BASIS OF SEX IN EDUCATION PROGRAMS OR ACTIVITIES RECEIVING FEDERAL FINANCIAL ASSISTANCE</HD>
     <SECTION>
        <SECTNO>### 229.120</SECTNO>
        <SUBJECT>Transfers of property.</SUBJECT>
        <P>If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).</P>
     </SECTION>

我希望能够将其打印到 CSV，以便我可以运行文本分析：

第 22 卷，第 2 卷，第 229 部分，第 228.120 节，如果接收者根据 ### 229.205 至 229.235(a) 的规定出售或以其他方式转让财产（……）。

请注意，我没有从 XML 中获取标题和卷号，因为它们实际上以更加标准化的格式包含在文件名中。

因为我是一个 Python 新手，所以代码主要基于 Udacity 计算机科学课程中的搜索引擎代码。这是迄今为止我编写/改编的 Python：

import os
import urllib2
from xml.dom.minidom import parseString
file_path = '/Users/owner1/Downloads/CFR-2012/title-22/CFR-2012-title22-vol1.xml'
file_name = os.path.basename(file_path) #Gets the filename from the path.
doc = open(file_path)
page = doc.read()

def clean_title(file_name): #Gets the title number from the filename.
    start_title = file_name.find('title')
    end_title = file_name.find("-", start_title+1)
    title = file_name[start_title+5:end_title]
    return title

def clean_volume(file_name): #Gets the volume number from the filename.
    start_volume = file_name.find('vol')
    end_volume = file_name.find('.xml', start_volume)
    volume = file_name[start_volume+3:end_volume]
    return volume

def get_next_section(page): #Gets all of the text between <SECTION> tags.
    start_section = page.find('<SECTION')
    if start_section == -1:
        return None, 0
    start_text = page.find('>', start_section)
    end_quote = page.find('</SECTION>', start_text + 1)
    section = page[start_text + 1:end_quote]
    return section, end_quote

def get_section_number(section): #Within the <SECTION> tag, find the section number based on the <SECTNO> tag.
    start_section_number = section.find('<SECTNO>###')
    if start_section_number == -1:
        return None, 0
    end_section_number = section.find('</SECTNO>', start_section_number)
    section_number = section[start_section_number+11:end_section_number]
    return section_number, end_section_number

def get_paragraph(section): #Within the <SECTION> tag, finds <P> paragraphs.
    start_paragraph = section.find('<P>')
    if start_paragraph == -1:
        return None, 0
    end_paragraph = section.find('</P>', start_paragraph)
    paragraph = section[start_paragraph+3:end_paragraph]
    return start_paragraph, paragraph, end_paragraph


def print_all_paragraphs(page): #This is the section that I would *like* to have print each paragraph and the citation hierarchy.
    section, endpos = get_next_section(page)
    for pragraph in section:
        title = clean_title(file_name)
        volume = clean_volume(file_name)
        section, endpos = get_next_section(page)
        section_number, end_section_number = get_section_number(section)
        start_paragraph, paragraph, end_paragraph = get_paragraph(section)
        if paragraph:
            print "Title: "+ title + " Volume: "+ volume +" Section Number: "+ section_number + " Text: "+ paragraph
            page = page[end_paragraph:]
        else:
            break

print print_all_paragraphs(page)
doc.close()

目前，此代码存在以下问题（示例输出如下）：

它会多次打印第一段。我如何打印每个
标签带有自己的标题号，卷号等？
CFR 有“保留”的空白部分。这些部分没有
标签，所以 if 循环中断。我尝试过实现 for/while 循环，但由于某种原因，当我这样做时，代码只会打印它重复找到的第一段。

这是一个输出示例：

Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member 

of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.11 Text: The Information and Privacy Coordinator shall be responsible for conducting a program for systematic declassification review of historically valuable records that were exempted from the automatic declassification provisions of section 3.3 of the Executive Order. The Information and Privacy Coordinator shall prioritize such review on the basis of researcher interest and the likelihood of declassification upon review.
Title: 22 Volume: 1 Section Number:  9.12 Text: For Department procedures regarding the access to classified information by historical researchers and certain former government personnel, see Sec. 171.24 of this Title.
Title: 22 Volume: 1 Section Number:  9.13 Text: Specific controls on the use, processing, storage, reproduction, and transmittal of classified information within the Department to provide protection for such information and to prevent access by unauthorized persons are contained in Volume 12 of the Department's Foreign Affairs Manual.
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled ‚ÄúClassification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.‚Äù
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled ‚ÄúClassification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.‚Äù
None

理想情况下，引文信息之后的每个条目都会有所不同。

我应该运行什么样的循环才能正确打印？是否有一种更“pythonic”的方式来进行这种文本提取？

我知道我是一个完全的新手，我面临的主要问题之一是我根本没有词汇或主题知识来真正找到有关解析具有这种详细程度的 XML 的详细答案。任何推荐阅读也将受到欢迎。

score 0 · Accepted Answer

我喜欢用 XPATH 或 XSLT 解决这样的问题。您可以在 lxml 中找到一个很好的实现（不在标准发行版中，需要安装）。例如，XPATH //CHAPTER/HD/SECTION[SECTNO] 选择所有包含数据的部分。您使用相对 XPATH 语句从那里获取您想要的值。多个嵌套的 for 循环消失。XPATH 有一些学习曲线，但有很多例子。

python - 如何使用 python 解析 XML 层次结构？

1 回答 1

Related

Reference