python - 正则表达式获取具有特定字母的所有单词的列表（unicode 字形）

Question

我正在为 FOSS 语言学习计划编写 Python 脚本。假设我有一个 XML 文件（或者为了简单起见，是一个 Python 列表），其中包含一个特定语言的单词列表（在我的例子中，这些单词是泰米尔语，它使用基于婆罗米语的印度语脚本）。

我需要画出可以仅使用这些字母拼写的单词的子集。

一个英文例子：

words = ["cat", "dog", "tack", "coat"] 

get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"]
get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]

泰米尔语示例：

words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]

get_words([u'ம', u'ப', u'ட', u'ம்')  should return [u"மடம்", u"படம்")
get_words([u'ப', u'ம்', u'ட') should return [u"படம்"]

返回单词的顺序或输入字母的顺序不应该有所不同。

尽管我了解 unicode 代码点和字形之间的区别，但我不确定它们在正则表达式中是如何处理的。

在这种情况下，我只想匹配输入列表中由特定字素组成的那些单词，而不是其他（即字母后面的标记应该只跟在那个字母后面，但字素本身可以出现在任何命令）。

score 5 · Accepted Answer

要支持可以跨越多个 Unicode 代码点的字符：

# -*- coding: utf-8 -*-
import re
import unicodedata
from functools import partial

NFKD = partial(unicodedata.normalize, 'NFKD')

def match(word, letters):
    word, letters = NFKD(word), map(NFKD, letters) # normalize
    return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word)

words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words = lambda letters: [w for w in words if match(w, letters)]

print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்'])))
# -> மடம் படம்
print(" ".join(get_words([u'ப', u'ம்', u'ட'])))
# -> படம்

它假设同一个字符可以在一个单词中使用零次或多次。

如果您只想要包含完全给定字符的单词：

import regex # $ pip install regex

chars = regex.compile(r"\X").findall # get all characters

def match(word, letters):
    return sorted(chars(word)) == sorted(letters)

words = ["cat", "dog", "tack", "coat"]

print(" ".join(get_words(['o', 'c', 'a', 't'])))
# -> coat
print(" ".join(get_words(['k', 'c', 't', 'a'])))
# -> tack

注意：cat在这种情况下，输出中没有，因为cat不使用所有给定的字符。

规范化是什么意思？你能解释一下 re.match() 正则表达式的语法吗？

>>> import re
>>> re.escape('.')
'\\.'
>>> c = u'\u00c7'
>>> cc = u'\u0043\u0327'
>>> cc == c
False
>>> re.match(r'%s$' % (c,), cc) # do not match
>>> import unicodedata
>>> norm = lambda s: unicodedata.normalize('NFKD', s)
>>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match
<_sre.SRE_Match object at 0x1364648>
>>> print c, cc
Ç Ç

没有规范化c和cc不匹配。字符来自unicodedata.normalize()文档。

score 3 · Accepted Answer

编辑：好的，不要使用这里的任何答案。当我认为 Python 正则表达式没有单词边界标记时，我写了所有这些，我试图解决这个缺陷。然后@Mark Tolonen 添加了 Python\b作为单词边界标记的注释！所以我发布了另一个简短的答案，使用\b. 我会把这个留在这里，以防有人有兴趣看到解决缺少的解决方案\b，但我真的不希望有人这样做。

制作一个只匹配特定字符集的字符串的正则表达式很容易。您需要使用的是一个“字符类”，其中仅包含您要匹配的字符。

我会用英语做这个例子。

[ocat] 这是一个字符类，它将匹配集合中的单个字符[o, c, a, t]。字符顺序无关紧要。

[ocat]+ 将 a+放在末尾使其匹配集合中的一个或多个字符。但这本身还不够；如果你有“教练”这个词，这将匹配并返回“coac”。

可悲的是，“单词边界”没有正则表达式功能。[编辑：事实证明这是不正确的，正如我在第一段中所说。]我们需要自己做一个。有两种可能的单词开头：行的开头，或将我们的单词与前一个单词分开的空格。类似地，有两种可能的词尾：行尾，或将我们的词与下一个词分开的空格。

由于我们将匹配一些我们不想要的额外内容，因此我们可以在我们想要的模式部分周围加上括号。

为了匹配两个备选方案，我们可以在括号中创建一个组，并用竖线分隔备选方案。Python 正则表达式有一个特殊的符号来创建一个我们不想保留其内容的组：(?:)

因此，这是匹配单词开头的模式。行首或空格： (?:^|\s)

这是词尾的模式。空格或行尾：`(?:\s|$)

综上所述，这是我们的最终模式：

(?:^|\s)([ocat]+)(?:\s|$)

您可以动态构建它。你不需要对整个事情进行硬编码。

import re

s_pat_start = r'(?:^|\s)(['
s_pat_end = r']+)(?:\s|$)'

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars is now set to the string: "ocat"

s_pat = s_pat_start + set_of_chars + s_pat_end
pat = re.compile(s_pat)

现在，这绝不会检查有效单词。如果您有以下文字：

This is sensible.  This not: occo cttc

我向您展示的模式将匹配occoand cttc，而这些并不是真正的单词。它们是仅由字母组成的字符串[ocat]。

所以只需对 Unicode 字符串做同样的事情。（如果您使用的是 Python 3.x，那么所有字符串都是 Unicode 字符串，所以就可以了。）将泰米尔语字符放入字符类中，您就可以开始了。

这有一个令人困惑的问题：re.findall()不会返回所有可能的匹配项。

编辑：好的，我知道是什么让我感到困惑。

我们想要的是我们的模式可以使用，re.findall()这样你就可以收集所有的单词。但re.findall()只能找到不重叠的模式。在我的示例中，re.findall()仅返回['occo']而不是['occo', 'cttc']按预期返回...但这是因为我的模式与 . 之后的空格匹配occo。匹配组没有收集空白，但它仍然匹配，并且由于re.findall()希望匹配之间没有重叠，所以空白被“用完”并且不适用于cttc.

解决方案是使用我以前从未使用过的 Python 正则表达式的一个特性：特殊的语法说“不能在前面”或“不能在后面”。该序列\S匹配任何非空白，因此我们可以使用它。但是标点符号不是空格，我认为我们确实希望标点符号来分隔单词。“必须在前面”或“必须在后面”也有特殊的语法。所以，我认为，这是我们能做的最好的：

构建一个字符串，意思是“当字符类字符串在行首并且后跟空格时匹配，或者当字符类字符串前面是空格并且后跟空格时，或者当字符类字符串前面是空格并且后跟结尾时匹配行，或者当字符类字符串前面是行首，后跟行尾”。

这是使用的模式ocat：

r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'

我很抱歉，但我真的认为这是我们能做的最好的事情，而且仍然可以合作re.findall()！

不过，它实际上在 Python 代码中不那么令人困惑：

import re

NMGROUP_BEGIN = r'(?:'  # begin non-matching group
NMGROUP_END = r')'  # end non-matching group

WS_BEFORE = r'(?<=\s)'  # require white space before
WS_AFTER = r'(?=\s)'  # require white space after

BOL = r'^' # beginning of line
EOL = r'$' # end of line

CCS_BEGIN = r'(['  #begin a character class string
CCS_END = r']+)'  # end a character class string

PAT_OR = r'|'

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"

CCS = CCS_BEGIN + set_of_chars + CCS_END  # build up character class string pattern

s_pat = (NMGROUP_BEGIN +
    BOL + CCS + WS_AFTER + PAT_OR +
    WS_BEFORE + CCS + WS_AFTER + PAT_OR +
    WS_BEFORE + CCS + EOL + PAT_OR +
    BOL + CCS + EOL +
    NMGROUP_END)

pat = re.compile(s_pat)

text = "This is sensible.  This not: occo cttc"

pat.findall(text)
# returns: [('', 'occo', '', ''), ('', '', 'cttc', '')]

所以，疯狂的是，当我们有可以匹配的替代模式时，re.findall()似乎为不匹配的替代模式返回一个空字符串。所以我们只需要从结果中过滤掉长度为零的字符串：

import itertools as it

raw_results = pat.findall(text)
results = [s for s in it.chain(*raw_results) if s]
# results set to: ['occo', 'cttc']

我想只构建四种不同的模式，re.findall()在每种模式上运行，然后将结果连接在一起可能不会那么令人困惑。

编辑：好的，这是构建四种模式并尝试每种模式的代码。我认为这是一种改进。

import re

WS_BEFORE = r'(?<=\s)'  # require white space before
WS_AFTER = r'(?=\s)'  # require white space after

BOL = r'^' # beginning of line
EOL = r'$' # end of line

CCS_BEGIN = r'(['  #begin a character class string
CCS_END = r']+)'  # end a character class string

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"

CCS = CCS_BEGIN + set_of_chars + CCS_END  # build up character class string pattern

lst_s_pat = [
    BOL + CCS + WS_AFTER,
    WS_BEFORE + CCS + WS_AFTER,
    WS_BEFORE + CCS + EOL,
    BOL + CCS
]

lst_pat = [re.compile(s) for s in lst_s_pat]

text = "This is sensible.  This not: occo cttc"

result = []
for pat in lst_pat:
    result.extend(pat.findall(text))

# result set to: ['occo', 'cttc']

编辑：好的，这是一种非常不同的方法。我最喜欢这个。

首先，我们将匹配文本中的所有单词。一个词被定义为一个或多个不是标点符号和空格的字符。

然后，我们使用过滤器从上面删除单词；我们只保留仅由我们想要的字符组成的单词。

import re
import string

# Create a pattern that matches all characters not part of a word.
#
# Note that '-' has a special meaning inside a character class, but it
# is valid punctuation that we want to match, so put in a backslash in
# front of it to disable the special meaning and just match it.
#
# Use '^' which negates all the chars following.  So, a word is a series
# of characters that are all not whitespace and not punctuation.

WORD_BOUNDARY = string.whitespace + string.punctuation.replace('-', r'\-')

WORD = r'[^' + WORD_BOUNDARY + r']+'


# Create a pattern that matches only the words we want.

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"

# build up character class string pattern
CCS = r'[' + set_of_chars + r']+'


pat_word = re.compile(WORD)
pat = re.compile(CCS)

text = "This is sensible.  This not: occo cttc"


# This makes it clear how we are doing this.
all_words = pat_word.findall(text)
result = [s for s in all_words if pat.match(s)]

# "lazy" generator expression that yields up good results when iterated
# May be better for very large texts.
result_genexp = (s for s in (m.group(0) for m in pat_word.finditer(text)) if pat.match(s))

# force the expression to expand out to a list
result = list(result_genexp)

# result set to: ['occo', 'cttc']

编辑：现在我不喜欢上述任何解决方案；请参阅另一个答案，即使用\b, 以获得 Python 中的最佳解决方案。

score 3 · Accepted Answer

制作一个只匹配特定字符集的字符串的正则表达式很容易。您需要使用的是一个“字符类”，其中仅包含您要匹配的字符。

我会用英语做这个例子。

[ocat]这是一个字符类，它将匹配集合中的单个字符[o, c, a, t]。字符顺序无关紧要。

[ocat]+将 + 放在末尾使其匹配集合中的一个或多个字符。但这本身还不够；如果你有这个词，"coach"这将匹配并返回"coac"。

\b[ocat]+\b' Now it only matches on word boundaries. (Thank you very much @Mark Tolonen for educating me about\b`。）

因此，只需构建一个像上面这样的模式，只在运行时使用所需的字符集，然后就可以了。您可以将此模式与re.findall()或一起使用re.finditer()。

import re

words = ["cat", "dog", "tack", "coat"]

def get_words(chars_seq, words_seq=words):
    s_chars = ''.join(chars_seq)
    s_pat = r'\b[' + s_chars + r']+\b'
    pat = re.compile(s_pat)
    return [word for word in words_seq if pat.match(word)]

assert get_words(['o', 'c', 'a', 't']) == ["cat", "coat"]
assert get_words(['k', 'c', 't', 'a']) == ["cat", "tack"]

score 2 · Accepted Answer

我不会使用正则表达式来解决这个问题。我宁愿像这样使用collections.Counter：

>>> from collections import Counter
>>> def get_words(word_list, letter_string):
    return [word for word in word_list if Counter(word) & Counter(letter_string) == Counter(word)]
>>> words = ["cat", "dog", "tack", "coat"]
>>> letters = 'ocat'
>>> get_words(words, letters)
['cat', 'coat']
>>> letters = 'kcta'
>>> get_words(words, letters)
['cat', 'tack']

此解决方案也适用于其他语言。Counter(word) & Counter(letter_string)找到两个计数器之间的交集，或 min(c[x], f[x])。如果这个交集等同于您的单词，那么您希望将该单词作为匹配项返回。

python - 正则表达式获取具有特定字母的所有单词的列表（unicode 字形）

4 回答 4

Related

Reference