python - python中的分割线

Question

我有以下形式的数据：

<a> <b> <c> <This is a string>
<World Bank> <provides> <loans for> <"a Country's Welfare">
<Facebook> <is a> <social networking site> <"Happy Facebooking => Enjoy">

现在我想根据分隔符 <> 分割上面给出的每一行。那就是我想拆分为：

['<a>', '<b>', '<c>', '<This is a string>']
['<World Bank>', '<provides>', '<loans for>', '<"a Country\'s Welfare">']
['<Facebook>', '<is a>', '<social networking site>', '<"Happy Facebooking => Enjoy">']

我尝试根据空格和“>”进行拆分，但它不起作用。python中是否有其他方法可以按照上述方式进行拆分。由于我的文件大小为 1 TB，因此我无法手动执行此操作。

score 7 · Accepted Answer

您想在和字符之间的空白处进行拆分。为此，您需要一个带有后视和前瞻断言的正则表达式：><

import re

re.split('(?<=>)\s+(?=<)', line)

\s+这会在前面有 a>和后面的字符的任何空格 ( ) 上拆分<。

该(?<=...)表达式是一个后视断言；它匹配输入文本中的位置，即断言内的模式在该位置之前的任何位置。在上面它匹配任何>在当前位置之前有一个字符的地方。

该(?=...)表达式的工作方式与后向断言类似，但它会在当前位置之后查找匹配的字符。它被称为前瞻断言。(?=<)意味着它将匹配到<字符后面的任何位置。

这些一起形成了两个锚点，\s+中间的一个只会匹配位于 a>和 a之间的空格<，而不是这两个字符本身。拆分通过删除匹配的文本来分解输入字符串，并且仅匹配空格，将>和<字符附加到要拆分的文本上。

演示：

>>> re.split('(?<=>)\s+(?=<)', '<a> <b> <c> <This is a string>')
['<a>', '<b>', '<c>', '<This is a string>']
>>> re.split('(?<=>)\s+(?=<)', '''<World Bank> <provides> <loans for> <"a Country's Welfare">''')
['<World Bank>', '<provides>', '<loans for>', '<"a Country\'s Welfare">']
>>> re.split('(?<=>)\s+(?=<)', '<Facebook> <is a> <social networking site> <"Happy Facebooking => Enjoy">')
['<Facebook>', '<is a>', '<social networking site>', '<"Happy Facebooking => Enjoy">']

score 0 · Accepted Answer

这是一种“构建你自己的解析器”的方法，它只是逐个字符地遍历文件，并且不使用任何花哨的正则表达式功能：

def tag_yielder(line):
    in_tag = False
    escape = False
    current_tag = ''
    for char in line:
        if in_tag:
            current_tag += char
            if char == '>' and not escape:
                yield current_tag
                current_tag = ''
                in_tag = False
            if char == '=':
                escape = True
            else:
                escape = False
        else:
            if char == '<':
                current_tag = '<'
                in_tag = True

for line in open('tag_text.txt'):
    print([tag for tag in tag_yielder(line.strip())])

输出：

['<a>', '<b>', '<c>', '<This is a string>']
['<World Bank>', '<provides>', '<loans for>', '<"a Country\'s Welfare">']
['<Facebook>', '<is a>', '<social networking site>', '<"Happy Facebooking => Enjoy">']

python - python中的分割线

2 回答 2

Related

Reference