python - 从 reStructuredText 中提取字段列表

Question

假设我有以下 reST 输入：

Some text ...

:foo: bar

Some text ...

我想结束的是这样的字典：

{"foo": "bar"}

我试着用这个：

tree = docutils.core.publish_parts(text)

它确实解析了字段列表，但我最终得到了一些伪 XML tree["whole"]?：

<document source="<string>">
    <docinfo>
        <field>
            <field_name>
                foo
            <field_body>
                <paragraph>
                    bar

由于treedict 不包含任何其他有用的信息，而这只是一个字符串，我不确定如何从 reST 文档中解析出字段列表。我该怎么做？

score 7 · Accepted Answer

您可以尝试使用类似以下代码的内容。而不是使用publish_parts我使用的方法publish_doctree来获取文档的伪 XML 表示。然后我已转换为 XML DOM 以提取所有field元素。然后我得到每个元素的第一个field_name和元素。field_bodyfield

from docutils.core import publish_doctree

source = """Some text ...

:foo: bar

Some text ...
"""

# Parse reStructuredText input, returning the Docutils doctree as
# an `xml.dom.minidom.Document` instance.
doctree = publish_doctree(source).asdom()

# Get all field lists in the document.
fields = doctree.getElementsByTagName('field')

d = {}

for field in fields:
    # I am assuming that `getElementsByTagName` only returns one element.
    field_name = field.getElementsByTagName('field_name')[0]
    field_body = field.getElementsByTagName('field_body')[0]

    d[field_name.firstChild.nodeValue] = \
        " ".join(c.firstChild.nodeValue for c in field_body.childNodes)

print d # Prints {u'foo': u'bar'}

xml.dom模块不是最容易使用的（为什么我需要使用.firstChild.nodeValue而不是仅仅.nodeValue举例），所以您可能希望使用xml.etree.ElementTree模块，我发现它更容易使用. 如果您使用 lxml，您还可以使用 XPATH 表示法来查找所有field,field_name和field_body元素。

score 0 · Accepted Answer

我有一个替代解决方案，我发现它的负担更小，但可能更脆弱。查看节点类https://sourceforge.net/p/docutils/code/HEAD/tree/trunk/docutils/docutils/nodes.py的实现后，您会看到它支持可用于拉取的 walk 方法无需为您的数据创建两种不同的 xml 表示形式即可取出所需的数据。这是我现在在原型代码中使用的内容：

https://github.com/h4ck3rm1k3/gcc-introspector/blob/master/peewee_adaptor.py#L33

from docutils.core import publish_doctree
import docutils.nodes

进而

def walk_docstring(prop):
    doc = prop.__doc__
    doctree = publish_doctree(doc)
    class Walker:
        def __init__(self, doc):
            self.document = doc
            self.fields = {}
        def dispatch_visit(self,x):
            if isinstance(x, docutils.nodes.field):
                field_name = x.children[0].rawsource
                field_value = x.children[1].rawsource
                self.fields[field_name]=field_value
    w = Walker(doctree)
    doctree.walk(w)
    # the collected fields I wanted
    pprint.pprint(w.fields)

score 0 · Accepted Answer

这是我的ElementTree实现：

from docutils.core import publish_doctree
from xml.etree.ElementTree import fromstring

source = """Some text ...

:foo: bar

Some text ...
"""


def gen_fields(source):
    dom = publish_doctree(source).asdom()
    tree = fromstring(dom.toxml())

    for field in tree.iter(tag='field'):
        name = next(field.iter(tag='field_name'))
        body = next(field.iter(tag='field_body'))
        yield {name.text: ''.join(body.itertext())}

用法

>>> next(gen_fields(source))
{'foo': 'bar'}

python - 从 reStructuredText 中提取字段列表

3 回答 3

Related

Reference