我认为有三种通用方法可以帮助您避免在循环结束时重复代码。对于这三个问题,我将使用一个与您自己的问题略有不同的示例问题,计算字符串中的单词。这是一个“默认”版本,就像您的代码一样,在循环结束时重复一些逻辑:
from collections import Counter
def countWords0(text):
counts = Counter()
word = ""
for c in text.lower():
if c not in "abcdefghijklmnopqrstuvwxyz'-":
if word:
counts[word] += 1
word = ""
else:
word += c
if word:
counts[word] += 1 # repeated code at end of loop
return counts
第一种方法是在每个字符之后进行(一些)“子序列结束”处理,以便如果序列在该字符之后立即结束,则簿记是正确的。在您的示例中,您可以消除您的“其他”条件并每次运行其中的代码。(这是 sergerg 的回答。)
不过,这对于某些类型的检查可能并不容易。为了计算单词,您需要添加一些额外的逻辑,以避免从您处理的“部分”子序列中积累垃圾。这是执行此操作的代码:
def countWords1(text):
counts = Counter()
word = ""
for c in text.lower():
if c not in "abcdefghijklmnopqrstuvwxyz'-":
word = ""
else:
if word:
counts[word] -= 1 # new extra logic
word += c
counts[word] += 1 # this line was moved from above
return counts + Counter() # more new stuff, to remove crufty zero-count items
第二种选择是将标记值附加到序列的末尾,这将触发所需的“子序列结束”行为。如果您需要避免哨兵污染您的数据(尤其是数字之类的东西),这可能会很棘手。对于最长连续子序列问题,您可以添加不等于序列中最后一项的任何值。None
可能是一个不错的选择。对于我的计数单词示例,非单词字符(例如换行符)将执行以下操作:
def countWords2(text):
counts = Counter()
word = ""
for c in text.lower() + "\n": # NOTE: added a sentinel to the string!
if c not in "abcdefghijklmnopqrstuvwxyz'-":
if word:
counts[word] += 1
word = ""
else:
word += c
# no need to recheck at the end, since we know we ended with a space
return counts
第三种方法是更改代码的结构以避免迭代可能意外结束的序列。您可以使用生成器来预处理序列,就像使用groupby
from的其他答案一样itertools
。(当然,生成器函数,如果非要自己写,可能也有类似的问题。)
对于我的字数统计示例,我可以使用模块中的正则表达式re
来查找单词:
from re import finditer
def countWords3(text):
return Counter(match.group() for match in
finditer("[\w'-]+", text.lower()))
输出,当给出适当的 Pythonic 文本时(所有四个版本的 countWords 都相同):
>>> text = """Well, there's egg and bacon; egg sausage and bacon;
egg and spam; egg bacon and spam; egg bacon sausage and spam;
spam bacon sausage and spam; spam egg spam spam bacon and spam;
spam sausage spam spam bacon spam tomato and spam;
spam spam spam egg and spam; spam spam spam spam spam spam
baked beans spam spam spam; or Lobster Thermidor a Crevette
with a mornay sauce served in a Provencale manner with shallots
and aubergines garnished with truffle pate, brandy and with a
fried egg on top and spam."""
>>> countWords0(text)
Counter({'spam': 28, 'and': 12, 'egg': 8, 'bacon': 7, 'sausage': 4, 'a': 4,
'with': 4, 'well': 1, 'lobster': 1, 'manner': 1, 'in': 1, 'top': 1,
'thermidor': 1, "there's": 1, 'truffle': 1, 'provencale': 1,
'sauce': 1, 'brandy': 1, 'pate': 1, 'shallots': 1, 'garnished': 1,
'tomato': 1, 'on': 1, 'baked': 1, 'aubergines': 1, 'mornay': 1,
'beans': 1, 'served': 1, 'fried': 1, 'crevette': 1, 'or': 1})