6

pyparsing今晚才开始使用,我已经建立了一个复杂的语法,它非常有效地描述了我正在使用的一些资源。它非常简单而且非常强大。但是,我在使用ParsedResults. 我需要能够按照找到它们的顺序迭代嵌套的标记,我发现它有点令人沮丧。我已经将我的问题抽象为一个简单的案例:

import pyparsing as pp

word = pp.Word(pp.alphas + ',.')('word*')
direct_speech = pp.Suppress('“') + pp.Group(pp.OneOrMore(word))('direct_speech*') + pp.Suppress('”')
sentence = pp.Group(pp.OneOrMore(word | direct_speech))('sentence')

test_string = 'Lorem ipsum “dolor sit” amet, consectetur.'

r = sentence.parseString(test_string)

print r.asXML('div')

print ''

for name, item in r.sentence.items():
    print name, item

print ''

for item in r.sentence:
    print item.getName(), item.asList()

据我所知,这应该有效吗?这是输出:

<div>
  <sentence>
    <word>Lorem</word>
    <word>ipsum</word>
    <direct_speech>
      <word>dolor</word>
      <word>sit</word>
    </direct_speech>
    <word>amet,</word>
    <word>consectetur.</word>
  </sentence>
</div>

word ['Lorem', 'ipsum', 'amet,', 'consectetur.']
direct_speech [['dolor', 'sit']]

Traceback (most recent call last):
  File "./test.py", line 27, in <module>
    print item.getName(), item.asList()
AttributeError: 'str' object has no attribute 'getName'

XML 输出似乎表明该字符串已完全按照我的意愿进行解析,但我无法遍历该句子,例如重新构建它。

有没有办法做我需要做的事?

谢谢!

编辑:

我一直在使用这个:

for item in r.sentence:
    if isinstance(item, basestring):
        print item
    else:
        print item.getName(), item

但这对我帮助不大,因为我无法区分不同类型的字符串。这是一个稍微扩展的示例:

word = pp.Word(pp.alphas + ',.')('word*')
number = pp.Word(pp.nums + ',.')('number*')

direct_speech = pp.Suppress('“') + pp.Group(pp.OneOrMore(word | number))('direct_speech*') + pp.Suppress('”')
sentence = pp.Group(pp.OneOrMore(word | number | direct_speech))('sentence')

test_string = 'Lorem 14 ipsum “dolor 22 sit” amet, consectetur.'

r = sentence.parseString(test_string)

for i, item in enumerate(r.sentence):
    if isinstance(item, basestring):
        print i, item
    else:
        print i, item.getName(), item

输出是:

0 Lorem
1 14
2 ipsum
3 word ['dolor', '22', 'sit']
4 amet,
5 consectetur.

不太有帮助。我无法区分wordand number,并且direct_speech元素被标记为word?!

我显然错过了一些东西。我想做的就是:

for item in r.sentence:
    if (item is a number):
        do something
    elif (item is a word):
        do something else
etc. ...

我应该以不同的方式处理这个问题吗?

4

2 回答 2

5

r.sentence包含字符串和 ParseResults 的混合,并且只有 ParseResults 支持getName()。您是否尝试过迭代r.sentence?如果我使用 asList() 将其打印出来,我会得到:

['Lorem', 'ipsum', ['dolor', 'sit'], 'amet,', 'consectetur.']

或者这个片段:

for item in r.sentence:
    print type(item),item.asList() if isinstance(item,pp.ParseResults) else item

给出:

<type 'str'> Lorem
<type 'str'> ipsum
<class 'pyparsing.ParseResults'> ['dolor', 'sit']
<type 'str'> amet,
<type 'str'> consectetur.

我不确定我是否回答了你的问题,但这是否说明下一步该去哪里?

(欢迎使用 Pyparsing)

于 2013-05-20T07:42:51.323 回答
1

好吧,我现在尝试了许多不同的方法,但我无法得到我需要的东西,所以(虽然看起来很荒谬),我正在使用.asXML()和解析生成的 XML。这是我的例子:

import pyparsing as pp

word = pp.Word(pp.alphas + ',.')('word*')
number = pp.Word(pp.nums + ',.')('number*')
direct_speech = pp.Suppress('“') + pp.Group(pp.OneOrMore(word | number))('direct_speech*') + pp.Suppress('”')
sentence = pp.Group(pp.OneOrMore(word | number | direct_speech))('sentence')

test_string = 'Lorem 14 ipsum “dolor 22 sit” amet, consectetur.'
r = sentence.parseString(test_string)

from lxml import etree
xml = etree.fromstring(r.sentence.asXML('sentence'))
for el in xml:
    if len(el):
        print el.tag
        for sub_el in el:
            print '  ', sub_el.tag, ':', sub_el.text
    else:
        print el.tag, ':',  el.text

输出:

word : Lorem
number : 14
word : ipsum
direct_speech
   word : dolor
   number : 22
   word : sit
word : amet,
word : consectetur.

房子周围似乎有很长的路要走,但似乎没有更好的方法。

于 2013-05-23T07:07:11.123 回答