python - 如何在 pyparsing 中为此编写语法：匹配一组单词但不包含给定模式

Question

我是 Python 和 pyparsing 的新手。我需要完成以下工作。

我的文本示例行是这样的：

12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009

我需要提取项目描述，期间

tok_date_in_ddmmmyyyy = Combine(Word(nums,min=1,max=2)+ " " + Word(alphas, exact=3) + " " + Word(nums,exact=4))
tok_period = Combine((tok_date_in_ddmmmyyyy + " to " + tok_date_in_ddmmmyyyy)|tok_date_in_ddmmmyyyy)

tok_desc =  Word(alphanums+"-()") but stop before tok_period

这该怎么做？

score 5 · Accepted Answer

我建议将 SkipTo 视为最合适的 pyparsing 类，因为您对不需要的文本有一个很好的定义，但在此之前几乎可以接受任何内容。以下是使用 SkipTo 的几种方法：

text = """\
12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009"""

# using tok_period as defined in the OP

# parse each line separately
for tx in text.splitlines():
    print SkipTo(tok_period).parseString(tx)[0]

# or have pyparsing search through the whole input string using searchString
for [[td,_]] in SkipTo(tok_period,include=True).searchString(text):
    print td

两个for循环都打印以下内容：

12 items - Ironing Service    
Washing service (3 Shirt)

score 3 · Accepted Answer

MK Saravanan，这个特殊的解析问题并不难用好的 'ole re:

import re
import string

text='''
12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009
This line does not match
'''

date_pat=re.compile(
    r'(\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?)')
for line in text.splitlines():
    if line:
        try:
            description,period=map(string.strip,date_pat.split(line)[:2])
            print((description,period))
        except ValueError:
            # The line does not match
            pass

产量

# ('12 items - Ironing Service', '11 Mar 2009 to 10 Apr 2009')
# ('Washing service (3 Shirt)', '23 Mar 2009')

这里的主要主力当然是 re 模式。让我们把它分开：

\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}是日期的正则表达式，相当于tok_date_in_ddmmmyyyy. \d{1,2}匹配一个或两个数字，\s+匹配一个或多个空格，[a-zA-Z]{3}匹配 3 个字母等。

(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?是一个由包围的正则表达式(?:...)。这表示非分组正则表达式。使用这个，没有组（例如 match.group(2)）被分配给这个正则表达式。这很重要，因为 date_pat.split() 返回一个列表，每个组都是列表的成员。通过抑制分组，我们将整个时期保持11 Mar 2009 to 10 Apr 2009在一起。末尾的问号表示此模式可能出现零次或一次。这允许正则表达式同时匹配 23 Mar 2009和11 Mar 2009 to 10 Apr 2009。

text.splitlines()\n在.上拆分文本

date_pat.split('12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009')

在 date_pat 正则表达式上拆分字符串。匹配项包含在返回的列表中。因此我们得到：

['12 items - Ironing Service ', '11 Mar 2009 to 10 Apr 2009', '']

map(string.strip,date_pat.split(line)[:2])美化结果。

如果line不匹配date_pat，则date_pat.split(line)返回[line,]，所以

description,period=map(string.strip,date_pat.split(line)[:2])

引发 ValueError 因为我们无法将只有一个元素的列表解压缩到 2 元组中。我们捕获了这个异常，但只是简单地传递到下一行。

python - 如何在 pyparsing 中为此编写语法：匹配一组单词但不包含给定模式

2 回答 2

Related

Reference