python - 如何确定字符串中的单词是否为双字？

Question

我想编写一个将文件作为字符串的函数，如果文件有重复的单词，则返回 True，否则返回 False。

到目前为止，我有：

def double(filename):
    infile = open(filename, 'r')
    res = False
    l = infile.split()
    infile.close()

    for line in l:
        #if line is in l twice
        res = True
    return res

如果我的文件包含：“有同一个词”

我应该得到 True

如果我的文件包含：“没有同一个词”

我应该得到 False

如何确定字符串中是否存在重复的单词

PS 重复的单词不必紧跟在另一个之后，即在“那里的句子中有相同的单词”应该返回 True，因为“那里”也是重复的。

score 4 · Accepted Answer

由于撇号和标点符号， str.split()方法不适用于拆分自然英文文本中的单词。为此，您通常需要正则表达式的强大功能：

>>> text = """I ain't gonna say ain't, because it isn't
in the dictionary. But my dictionary has it anyways."""

>>> text.lower().split()
['i', "ain't", 'gonna', 'say', "ain't,", 'because', 'it', "isn't", 'in', 'the',
 'dictionary.', 'but', 'my', 'dictionary', 'has', 'it', 'anyways.']

>>> re.findall(r"[a-z']+", text.lower())
['i', "ain't", 'gonna', 'say', "ain't", 'because', 'it', "isn't", 'in', 'the',
 'dictionary', 'but', 'my', 'dictionary', 'has', 'it', 'anyways']

要查找是否有任何重复的单词，可以使用set 操作：

>>> len(words) != len(set(words))
True

要列出重复的单词，请使用collections.Counter中的多重集操作：

>>> sorted(Counter(words) - Counter(set(words)))
["ain't", 'dictionary', 'it']

score 3 · Accepted Answer

def has_duplicates(filename):
    seen = set()
    for line in open(filename):
        for word in line.split():
            if word in seen:
                return True
            seen.add(word)
    return False

score 0 · Accepted Answer

使用集合检测重复项：

def double(filename):
    seen = set()
    with open(filename, 'r') as infile:
        for line in l:
            for word in line.split():
                if word in seen:
                     return True
                seen.add(word)
    return False

您可以将其缩短为：

def double(filename):
    seen = set()
    with open(filename, 'r') as infile:
        return any(word in seen or seen.add(word) for line in l for word in line.split())

两个版本都提前退出；一旦找到重复的单词，函数就返回True；它确实必须读取整个文件以确定没有重复并返回False。

score 0 · Accepted Answer

0

a = set()
for line in l:
  if (line in a):
    return True
  a.add(line)
return False

于 2013-06-02T17:23:38.797 回答

score 0 · Accepted Answer

另一种检测重复词的通用方法，包括collections.Counter

from itertools import chain
from collections import Counter
with open('test_file.txt') as f:
    x = Counter(chain.from_iterable(line.split() for line in f))
    for (key, value) in x.iteritems():
            if value > 1:
                    print key

python - 如何确定字符串中的单词是否为双字？

5 回答 5

Related

Reference