python - 需要帮助在 python 中拆分字符串

Question

我正在尝试使用如下模式标记字符串。

>>> splitter = re.compile(r'((\w*)(\d*)\-\s?(\w*)(\d*)|(?x)\$?\d+(\.\d+)?(\,\d+)?|([A-Z]\.)+|(Mr)\.|(Sen)\.|(Miss)\.|.$|\w+|[^\w\s])')
>>> splitter.split("Hello! Hi, I am debating this predicament called life. Can you help me?")

我得到以下输出。有人可以指出我需要纠正什么吗？我对一堆“无”感到困惑。此外，如果有更好的方法来标记字符串，我真的很感激额外的帮助。

['', 'Hello', None, None, None, None, None, None, None, None, None, None, '', '!', None, None, None, None, None, None, None, None, None, None, ' ', 'Hi', None,None, None, None, None, None, None, None, None, None, '', ',', None, None, None, None, None, None, None, None, None, None, ' ', 'I', None, None, None, None, None, None, None, None, None, None, ' ', 'am', None, None, None, None, None, None,None, None, None, None, ' ', 'debating', None, None, None, None, None, None, None, None, None, None, ' ', 'this', None, None, None, None, None, None, None, None, None, None, ' ', 'predicament', None, None, None, None, None, None, None, None, None, None, ' ', 'called', None, None, None, None, None, None, None, None, None, None, ' ', 'life', None, None, None, None, None, None, None, None, None, None, '', '.', None, None, None, None, None, None, None, None, None, None, ' ', 'Can', None, None, None, None, None, None, None, None, None, None, ' ', 'you', None, None, None, None, None, None, None, None, None, None, ' ', 'help', None, None,None, None, None, None, None, None, None, None, ' ', 'me', None, None, None, None, None, None, None, None, None, None, '', '?', None, None, None, None, None, None, None, None, None, None, '']

我想要的输出是： -

['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life', '.', 'Can', 'you', 'help', 'me', '?']

谢谢你。

score 4 · Accepted Answer

re.split用作标记器时会迅速用完粉扑。优选的是findall（或match在循环中）具有替代模式的this|that|another|more

>>> s = "Hello! Hi, I am debating this predicament called life. Can you help me?"
>>> import re
>>> re.findall(r"\w+|\S", s)
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life', '.', 'Can', 'you', 'help', 'me', '?']
>>>

这将标记定义为一个或多个“单词”字符，或者不是空格的单个字符。您可能更喜欢[A-Za-z]or[A-Za-z0-9]或其他东西而不是\w（允许下划线）。你甚至可能想要类似的东西r"[A-Za-z]+|[0-9]+|\S"

如果像Sen.,Mr.和Miss（和发生了什么Mrs？Ms）这样的事情对你很重要，你的正则表达式不应该列出它们，它应该只定义一个以结尾的标记.，并且你应该有一个字典或一组可能的缩写。

将文本拆分成句子很复杂。您可能希望查看nltk包装而不是尝试重新发明轮子。

更新：如果您需要/想要区分令牌的类型，您可以获得这样的索引或名称，而无需（可能很长）链 if/elif/elif/.../else：

>>> s = "Hello! Hi, I we 0 1 987?"

>>> pattern = r"([A-Za-z]+)|([0-9]+)|(\S)"
>>> list((m.lastindex, m.group()) for m in re.finditer(pattern, s))
[(1, 'Hello'), (3, '!'), (1, 'Hi'), (3, ','), (1, 'I'), (1, 'we'), (2, '0'), (2,     '1'), (2, '987'), (3, '?')]

>>> pattern = r"(?P<word>[A-Za-z]+)|(?P<number>[0-9]+)|(?P<other>\S)"
>>> list((m.lastgroup, m.group()) for m in re.finditer(pattern, s))
[('word', 'Hello'), ('other', '!'), ('word', 'Hi'), ('other', ','), ('word', 'I'), ('word', 'we'), ('number', '0'), ('number', '1'), ('number', '987'), ('other'
, '?')]
>>>

score 4 · Accepted Answer

我推荐NLTK的分词器。那你就不用自己操心繁琐的正则表达式了：

>>> import nltk
>>> nltk.word_tokenize("Hello! Hi, I am debating this predicament called life. Can you help me?")
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life.', 'Can', 'you', 'help', 'me', '?']

score 2 · Accepted Answer

可能会遗漏一些东西，但我相信像下面这样的东西会起作用：

s = "Hello! Hi, I am debating this predicament called life. Can you help me?"
s.split(" ")

这是假设您想要空格。你应该得到一些类似的东西：

['Hello!', 'Hi,', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life.', 'Can', 'you', 'help', 'me?']

有了这个，如果你需要一个特定的部分，你可以循环通过它来获得你需要的东西。

希望这会有所帮助....

score 1 · Accepted Answer

你得到所有这些的原因None是因为你的正则表达式中有很多括号组，由|'s 分隔。每次您的正则表达式找到匹配项时，它只匹配|'s 给出的备选方案之一。其他未使用的替代项中的括号组设置为None。根据定义，re.split每次匹配时都会报告所有带括号的组的值，因此None结果中有很多 '。

您可以很容易地将它们过滤掉（例如tokens = [t for t in tokens if t]或类似的东西），但我认为split这并不是您真正想要的标记化工具。split仅用于丢弃空格。如果你真的想使用正则表达式来标记某些东西，这里是另一种方法的玩具示例（我什至不会尝试解开你正在使用的那个怪物......使用re.VERBOSENed 的爱的选项。 ..但希望这个玩具示例能给你这个想法）：

tokenpattern = re.compile(r"""
(?P<words>\w+) # Things with just letters and underscores
|(?P<numbers>\d+) # Things with just digits
|(?P<other>.+?) # Anything else
""", re.VERBOSE)

该(?P<something>...业务允许您在下面的代码中通过名称识别您正在寻找的令牌类型：

for match in tokenpattern.finditer("99 bottles of beer"):
  if match.group('words'):
    # This token is a word
    word = match.group('words')
    #...
  elif match.group('numbers'):
    number = int(match.group('numbers')):
  else:
    other = match.group('other'):

请注意，这仍然是使用一堆用|'s 分隔的带括号的组，因此将发生与您的代码中相同的事情：对于每个匹配项，将定义一个组，而将其他组设置为None。此方法明确检查。

score 0 · Accepted Answer

也许他不是这个意思，但 John Machin 的评论“str.split 不是一个开始的地方”（作为弗兰克 V 回答后交流的一部分）来作为一个挑战。所以 ...

the_string = "Hello! Hi, I am debating this predicament called life. Can you help me?"
tokens = the_string.split()
punctuation = ['!', ',', '.', '?']
output_list = []
for token in tokens:
    if token[-1] in punctuation:
        output_list.append(token[:-1])
        output_list.append(token[-1])
    else:
        output_list.append(token)
print output_list

这似乎提供了请求的输出。

当然，就代码行数而言，John 的答案更简单。但是，我有几点要支持这种解决方案。

我并不完全同意 Jamie Zawinski 的“有些人在遇到问题时会想“我知道，我会使用正则表达式”。现在他们有两个问题。（从我读过的内容来看，他也没有。）我引用这一点的意思是，如果您不习惯正则表达式，那么开始工作可能会很痛苦。

此外，虽然它通常不会成为问题，但当使用timeit测量时，上述解决方案的性能始终优于正则表达式解决方案。上述解决方案（删除了 print 语句）大约在 8.9 秒内出现；John 的正则表达式解决方案大约在 11.8 秒时出现。这涉及在以 2.4 GHz 运行的四核双处理器系统上进行 10 次尝试，每次迭代 100 万次。

python - 需要帮助在 python 中拆分字符串

5 回答 5

Related

Reference