我从gensim 教程页面中看到了以下脚本片段。
下面 Python 脚本中逐字逐句的语法是什么?
>> texts = [[word for word in document.lower().split() if word not in stoplist]
>> for document in documents]
我从gensim 教程页面中看到了以下脚本片段。
下面 Python 脚本中逐字逐句的语法是什么?
>> texts = [[word for word in document.lower().split() if word not in stoplist]
>> for document in documents]
这是一个列表理解。您发布的代码循环遍历其中的每个元素,document.lower.split()
并创建一个仅包含满足if
条件的元素的新列表。它对documents
.
试试看...
elems = [1, 2, 3, 4]
squares = [e*e for e in elems] # square each element
big = [e for e in elems if e > 2] # keep elements bigger than 2
从您的示例中可以看出,列表推导可以嵌套。
那是一个列表理解。一个更简单的例子可能是:
evens = [num for num in range(100) if num % 2 == 0]
我很确定我在一些 NLP 应用程序中看到了这条线。
此列表理解:
[[word for word in document.lower().split() if word not in stoplist] for document in documents]
是相同的
ending_list = [] # often known as document stream in NLP.
for document in documents: # Loop through a list.
internal_list = [] # often known as a a list tokens
for word in document.lower().split():
if word not in stoplist:
internal_list.append(word) # this is where the [[word for word...] ...] appears
ending_list.append(internal_list)
基本上,您需要一个包含令牌列表的文档列表。因此,通过遍历文档,
for document in documents:
然后将每个文档拆分为标记
list_of_tokens = []
for word in document.lower().split():
然后列出这些令牌:
list_of_tokens.append(word)
例如:
>>> doc = "This is a foo bar sentence ."
>>> [word for word in doc.lower().split()]
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']
它与以下内容相同:
>>> doc = "This is a foo bar sentence ."
>>> list_of_tokens = []
>>> for word in doc.lower().split():
... list_of_tokens.append(word)
...
>>> list_of_tokens
['this', 'is', 'a', 'foo', 'bar', 'sentence', '.']