1

I have a list that contains a lot of tagged bigrams. Some of the bigrams are not tagged correctly so I want to remove them from the master list. One of the words of a bigrams keeps repeating frequently, so I can remove the bigram if it contains an xyz word. Psudo example is below:

master_list = ['this is', 'is a', 'a sample', 'sample word', 'sample text', 'this book', 'a car', 'literary text', 'new book', 'them about', 'on the' , 'in that', 'tagged corpus', 'on top', 'a car', 'an orange', 'the book', 'them what', 'then how']

unwanted_words = ['this', 'is', 'a', 'on', 'in', 'an', 'the', 'them']

new_list = [item for item in master_list if not [x for x in unwanted_words] in item]

I can remove the items separately, i.e. every time I create a list and remove the items which contain the word, say, 'on'. This is tedious and it will require hours of filtering and creating new lists for filtering each unwanted word. I thought that a loop will help. However, I get the following type error:

Traceback (most recent call last):
File "<pyshell#21>", line 1, in <module>
new_list = [item for item in master_list if not [x for x in  unwanted_words] in item]
File "<pyshell#21>", line 1, in <listcomp>
new_list = [item for item in master_list if not [x for x in unwanted_words] in item]
TypeError: 'in <string>' requires string as left operand, not list

Your help is highly appreciated!

4

1 回答 1

1

您的条件if not [x for x in unwanted_words] in item与 相同if not unwanted_words in item,即您正在检查列表是否包含在字符串中。

相反,您可以使用any来检查二元组的任何部分是否在unwanted_words. 此外,您可以使用unwanted_wordsaset来加快查找速度。

>>> master_list = ['this is', 'is a', 'a sample', 'sample word', 'sample text', 'this book', 'a car', 'literary text', 'new book', 'them about', 'on the' , 'in that', 'tagged corpus', 'on top', 'a car', 'an orange', 'the book', 'them what', 'then how']
>>> unwanted_words = set(['this', 'is', 'a', 'on', 'in', 'an', 'the', 'them'])
>>> [item for item in master_list if not any(x in unwanted_words for x in item.split())]
['sample word', 'sample text', 'literary text', 'new book', 'tagged corpus', 'then how']
于 2015-03-21T22:47:32.700 回答