鉴于我有一个像这样的字符串:
'velvet evening purse bags'
我怎样才能得到这个的所有单词对?换句话说,所有这两个词的组合:
'velvet evening'
'velvet purse'
'velvet bags'
'evening purse'
'evening bags'
'purse bags'
我知道 python 的nltk
包可以提供二元组,但我正在寻找超出该功能的东西。还是我必须用 Python 编写自己的自定义函数?
您可以itertools.combinations
为此使用:
s = 'velvet evening purse bags'
from nltk import word_tokenize
words = word_tokenize(s)
from itertools import combinations
pairs = [' '.join(comb) for comb in combinations(words, 2)]
print(pairs)
输出:
['velvet evening', 'velvet purse', 'velvet bags', 'evening purse', 'evening bags', 'purse bags']
这应该很有趣=)
如果输入是velvet evening purse bags
并且所需的输出是 @MrGeek 使用 产生itertools.combinations
的,那实际上是skipgrams
来自https://tedboy.github.io/nlps/generated/generated/nltk.skipgrams.html的定义
因此,您可以通过以下方式实现相同的目标:
from nltk import skipgrams
s = 'velvet evening purse bags'
tokens = word_tokenize(s)
list(skipgrams(tokens, n=2, k=len(tokens)-1))
[出去]:
[('velvet', 'evening'),
('velvet', 'purse'),
('velvet', 'bags'),
('evening', 'purse'),
('evening', 'bags'),
('purse', 'bags')]
在这种情况下,每个单词只能与它右侧的另一个单词组合;这有点符合人类的英语语言。
在这种情况下,单词的所有“排列”都会配对,甚至与它自己配对:
from itertools import product
s = 'velvet evening purse bags'
tokens = set(word_tokenize(s))
list(product(tokens, tokens))
[出去]:
[('velvet', 'velvet'),
('velvet', 'evening'),
('velvet', 'purse'),
('velvet', 'bags'),
('evening', 'velvet'),
('evening', 'evening'),
('evening', 'purse'),
('evening', 'bags'),
('purse', 'velvet'),
('purse', 'evening'),
('purse', 'purse'),
('purse', 'bags'),
('bags', 'velvet'),
('bags', 'evening'),
('bags', 'purse'),
('bags', 'bags')]
你也可以去老学校...
text = 'velvet evening purse bags'
n = []
ans = []
for i in text.split():
for j in text.split():
if j != i:
if (i, j) not in n:
ans.append((i, j))
n.append((i, j))
n.append((j, i))
输出
[('velvet', 'evening'),
('velvet', 'purse'),
('velvet', 'bags'),
('evening', 'purse'),
('evening', 'bags'),
('purse', 'bags')]