Find centralized, trusted content and collaborate around the technologies you use most.
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
我有一个包含博客文章的元组,它看起来像这样:
[('category1', 'blablablabla'), ('Category2', 'bla bla bla'), ('category1', 'blabla')].
现在我需要从中获取每个类别中最常见的词,但是我无法在不丢失类别的情况下标记这些词。在元组上标记化失败的标准方法,我使用了 nltk 中的解析器和 .split() 方式,但两者都不适用于元组。任何人都可以提供任何帮助吗?
Assuming you have a function tokenize that returns tokens when given a string:
tokenize
for cat, text in tuples: tokenized = tokenize(text) # now do whatever you want with the category and the tokenized text