python - 使用 NLTK 删除停用词

Question

我正在尝试通过使用 nltk 工具包删除停用词来处理用户输入的文本，但是通过停用词删除，诸如“和”、“或”、“不”之类的词会被删除。我希望这些词在停用词删除过程之后出现，因为它们是稍后将文本处理为查询所需的运算符。我不知道哪些词可以作为文本查询中的运算符，我也想从文本中删除不必要的词。

score 144 · Accepted Answer

有一个内置的停用词列表，NLTK由 11 种语言的 2,400 个停用词组成（Porter 等人），请参阅http://nltk.org/book/ch02.html

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> stop = set(stopwords.words('english'))
>>> sentence = "this is a foo bar sentence"
>>> print([i for i in sentence.lower().split() if i not in stop])
['foo', 'bar', 'sentence']
>>> [i for i in word_tokenize(sentence.lower()) if i not in stop] 
['foo', 'bar', 'sentence']

我建议查看使用 tf-idf 删除停用词，请参阅词干对词频的影响？

score 71 · Accepted Answer

我建议您创建自己的运算符单词列表，将其从停用词列表中取出。集合可以方便地减去，所以：

operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators

然后，您可以简单地测试一个词是in或not in集合，而不依赖于您的运算符是否是停用词列表的一部分。然后，您可以稍后切换到另一个停用词列表或添加运算符。

if word.lower() not in stop:
    # use word

score 35 · Accepted Answer

@alvas 的回答可以完成这项工作，但可以做得更快。假设你有documents：一个字符串列表。

from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize

stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']) # remove it if you need punctuation 

for doc in documents:
    list_of_words = [i.lower() for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]

请注意，由于您在这里搜索的是一个集合（而不是列表），因此理论上速度会len(stop_words)/2快几倍，如果您需要对许多文档进行操作，这很重要。

对于 5000 个文档，每个文档大约 300 个单词，我的示例相差 1.8 秒，@alvas 的相差 20 秒。

PS 在大多数情况下，您需要将文本分成单词以执行使用 tf-idf 的其他一些分类任务。所以很可能最好也使用词干分析器：

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

并[porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]在循环内使用。

score 14 · Accepted Answer

@alvas 有一个很好的答案。但这又取决于任务的性质，例如在您的应用程序中，您要考虑所有，conjunction例如and, or, but, if, while和 alldeterminer例如the, a, some, most, every, no作为停用词考虑所有其他词性是合法的，那么你可能想研究这个使用词性标签集丢弃单词的解决方案，检查表 5.1：

import nltk

STOP_TYPES = ['DET', 'CNJ']

text = "some data here "
tokens = nltk.pos_tag(nltk.word_tokenize(text))
good_words = [w for w, wtype in tokens if wtype not in STOP_TYPES]

score 6 · Accepted Answer

您可以将string.punctuation与内置的 NLTK 停用词列表一起使用：

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation

words = tokenize(text)
wordsWOStopwords = removeStopWords(words)

def tokenize(text):
        sents = sent_tokenize(text)
        return [word_tokenize(sent) for sent in sents]

def removeStopWords(words):
        customStopWords = set(stopwords.words('english')+list(punctuation))
        return [word for word in words if word not in customStopWords]

NLTK 停用词完整列表

score 0 · Accepted Answer

从字符串中删除停用词

在这里，我还添加了自定义停用词列表

nltk.download('stopwords')
from nltk.corpus import stopwords                    # Stop words

stop_words = set(stopwords.words('english'))
stop_words.update(list(set(['zero'    , 'one'     , 'two'      ,
               'three'   , 'four'    , 'five'     ,
               'six'     , 'seven'   , 'eight'    ,
               'nine'    , 'ten'     ,
               
               'may'     , 'also'    , 'across'   ,
               'among'   , 'beside'  , 'however'  ,
               'yet'     , 'within'  ,
               
               'jan'     ,  'feb'    , 'mar'      ,
               'apr'     ,  'may'    , 'jun'      ,
               'jul'     ,  'aug'    , 'sep'      ,
               'oct'     ,  'nov'    , 'dec'      ,
               
               'january' , 'february', 'march'    ,
               'april'   , 'may'     , 'june'     ,
               'july'    , 'august'  , 'september',
               'october' , 'november', 'december' ,
               
               'summer'  , 'winter'  , 'fall'     ,
               'spring'                          

               "a"         , "about"     ,   "above"  , "after"   ,
               "again"     , "against"   ,   "ain"    , "aren't"  ,
               "all"       , "am"        ,   "an"     , "and"     ,
               "any"       , "are"       ,   "aren"   ,  "as"     ,
               "at"        ,
               
               "be"        , "because"   ,   "been"   , "before"  ,
               "being"     , "below"     ,   "between", "both"    ,
               "but"       , "by"        ,                  
               
               "can"       , "couldn"    , "couldn't" , "could"   ,
               
               "d"         , "did"       , "didn"     , "didn't"  ,
               "do"        , "does"      , "doesn"    , "doesn't" ,
               "doing"     , "don"       , "don't"    , "down"    ,
               "during"    ,
               
               "each"      ,  
               
               "few"       , "for"      , "from"      , "further" ,
               
               "had"       , "hadn"     , "hadn't"    , "has"     ,
               "hasn"      , "hasn't"   , "have"      , "haven"   ,
               "haven't"   , "having"   , "he"        , "her"     ,
               "here"      , "hers"     , "herself"   , "him"     ,
               "himself"   , "his"      , "how"       ,
               "he'd"      , "he'll"    , "he's"      , "here's"  ,
               "how's"     ,
               
               "i"         , "if"       , "in"        , "into"    ,
               "is"        , "isn"      , "isn't"     , "it"      ,
               "it's"      , "its"      , "itself"    , "i'd"     ,
               "i'll"      , "i'm"      , "i've"      ,
               
               "just"      ,
               
               "ll"        , "let's"    ,
               
               "m"         , "ma"       ,"me"         ,
               "mightn"    , "mightn't" , "more"      , "most"    ,
               "mustn"     , "mustn't"  , "my"        , "myself"  ,
               "needn"     , "needn't"  , "no"        , "nor"     ,
               "not"       , "now"      ,
               
               "o"         , "of"       , "off"       , "on"      ,
               "once"      , "only"     , "or"        , "other"   ,
               "our"       , "ours"     , "ourselves" , "out"     ,
               "over"      , "own"      , "ought"     ,
               
               "re"        ,
               
               "s"         , "same"     , "shan"      , "shan't"   ,
               "she"       , "she's"    , "should"    , "should've",
               "shouldn"   , "shouldn't", "so"        , "some"     ,
               "such"      , "she'd"    , "she'll"    ,
               
               "t"         , "than"     , "that"      , "that'll"  ,
               "the"       , "their"    , "theirs"    , "them"     ,
               "themselves", "then"     , "there"     , "these"    ,
               "they"      , "this"     , "those"     , "through"  ,
               "to"        , "too"      , "that's"    , "there's"  ,
               "they'd"    , "they'll"  , "they're"   , "they've"  ,
               
               "under"     , "until"    , "up"        ,
               
               "ve"        , "very"     ,
               
               "was"       , "wasn"     , "wasn't"    , "we"       ,
               "were"      , "weren"    , "weren't"   , "what"     ,
               "when"      , "where"    , "which"     , "while"    ,
               "who"       , "whom"     , "why"       , "will"     ,
               "with"      , "won"      , "won't"     , "wouldn"   ,
               "wouldn't"  , "we'd"     , "we'll"     , "we're"    ,
               "we've"     , "what's"   , "when's"    , "where's"  ,
               "who's"     , "why's"    , "would"     ,
               
               "y"         , "you"      , "you'd"     , "you'll"   ,
               "you're"    , "you've"   , "your"      , "yours"    , "yourself",
               "yourselves",
               
               'a',"able", "abst", "accordance", "according", "accordingly", "across", "act", "actually"          ,
               "added", "adj", "affected", "affecting", "affects", "afterwards", "ah",      "almost"          ,
               "alone", "along", "already", "also", "although", "always", "among", "amongst", "anyone"        ,  
               "announce", "another", "anybody", "anyhow", "anymore",  "anything", "anyway", "anyways"        ,
               "anywhere", "apparently", "approximately", "arent", "arise", "around", "aside", "ask"          ,
               "asking", "auth", "available", "away", "awfully", "a's", "ain't", "allow", "allows", "apart"   ,
               "appear", "appreciate", "appropriate", "associated"                                            ,
               
               "b", "back", "became", "become", "becomes", "becoming", "beforehand", "begin", "beginning"     ,
               "beginnings", "begins", "behind", "believe", "beside", "besides", "beyond", "biol", "brief"    ,
               "briefly"                                                                                      ,
               
               "c", "ca", "came", "cannot", "can't", "cause", "causes", "certain", "certainly", "co", "com"   ,
               "come", "comes", "contain", "containing", "contains", "couldnt"                                ,
               
               'd',"date", "different", "done", "downwards", "due"                                                ,
               
               "e", "ed", "edu", "effect", "eg", "eight", "eighty", "either", "else", "elsewhere", "end"      ,
               "ending", "enough", "especially", "et", "etc", "even", "ever", "every", "everybody","except"   ,
               "everyone", "everything", "everywhere", "ex"                                                   ,  
               
               "f", "far", "ff", "fifth", "first", "five", "fix", "followed", "following", "follows", "four"  ,
               "former", "formerly", "forth", "found",  "furthermore"                                         ,
               
               "g", "gave", "get", "gets", "getting", "give", "given", "gives",  "go", "goes", "got","gone"   ,  
               "gotten", "giving"                                                                             ,
               
               "h", "happens", "hardly", "hed", "hence", "hereafter", "hereby", "herein", "heres", "however"  ,
               "hereupon", "hes", "hi", "hid", "hither", "home", "howbeit",  "hundred"                        ,
               
               "id", "ie", "im", "immediately", "importance", "important", "inc", "indeed", "itd", "index"    ,
               'i',"information", "instead", "invention",   "it'll", "inward", "immediate"                        ,
               
               "j",
               
               "k", "keep", "keeps", "kept", "kg", "km", "know", "known", "knows"                             ,
               
               "l", "largely", "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "ltd",    
               "let", "lets", "like", "liked", "likely", "line", "little", "'ll", "look", "looking", "looks"  ,  
               
               'm',"made", "mainly", "make", "makes", "many", "maybe", "mean", "means", "meantime", "merely", "mg",
               "might", "million", "miss", "ml", "moreover", "mostly", "mr", "mrs", "much", "mug", "must"     ,
               "meanwhile", "may"                                                                             ,
               
               "n", "na", "name", "namely", "nay", "nd", "near", "nearly", "necessarily", "necessary", "need" ,
               "needs", "neither", "never", "nevertheless", "new", "next", "nine", "ninety", "nobody", "non"  ,
               "none", "nonetheless", "noone", "normally", "nos", "noted", "nothing", "nowhere", "n2", "nc"   ,
               "nd", "ne", "ng", "ni", "nj", "nl", "nn", "nr", "ns", "nt", "ny"                               ,
               
               'o',"obtain", "obtained", "obviously", "often", "oh", "ok", "okay", "old", "omitted", "one", "ones",
               "onto", "ord", "others", "otherwise", "outside", "overall", "owing",  "oa", "ob", "oc", "od"   ,
               "of", "og", "oi", "oj", "ol", "om", "on", "oo", "oq", "or", "os", "ot", "ou", "ow", "ox", "oz" ,
               
               "p", "page", "pages", "part", "particular", "particularly", "past", "per", "perhaps", "placed" ,
               "please", "plus", "poorly", "possible", "possibly", "potentially", "pp", "predominantly"       ,
               "present", "previously", "primarily", "probably", "promptly", "proud", "provides", "put"       ,
               "p1", "p2", "p3", "pc", "pd", "pe", "pf", "ph", "pi", "pj", "pk", "pl", "pm", "pn", "po", "pq" ,
               "pr", "ps", "pt", "pu", "py"                                                                   ,
               
               "q", "que", "quickly", "quite", "qv",  "qj", "qu"                                              ,
               
               'r',"readily", "really", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards" ,
               "related", "relatively", "research", "respectively", "resulted", "resulting", "results", "run" ,
               "right",  "r2", "ra", "rc", "rd", "rf", "rh", "ri", "rj", "rl", "rm", "rn", "ro", "rq", "rr"   ,
               "rs", "rt", "ru", "rv", "ry" "r", "ran", "rather", "rd"                                        ,                                                                  
               
               's',"said", "saw", "say", "saying", "says", "sec", "section", "see", "seeing", "seem", "seemed"    ,
               "seeming", "seems", "seen", "self", "selves", "sent", "seven", "several", "shall", "shed"      ,
               "shes", "show", "showed", "shown", "showns", "shows", "significant", "significantly"           ,
               "similar", "similarly", "since", "six", "slightly", "somebody", "somehow", "someone", "soon"   ,
               "somewhat", "somewhere", "specifically", "specified", "specify", "specifying", "still", "stop" ,
               "strongly", "sub", "substantially", "successfully", "sufficiently", "suggest", "sup", "sure"   ,
               "s2", "sa", "sc", "sd", "se", "sf", "si", "sj", "sl", "sm", "sn", "sp", "sq", "sr", "ss", "st" ,
               "sy", "sz",   "sorry", "sometime", "somethan", "something", "sometimes"                        ,
               
               't',"take", "taken", "taking", "tell", "tends", "thank", "thanx", "that've", "thence", "thereafter",
               "thereby", "therefore", "therein", "there'll", "thereof", "therere", "thereto", "thereupon"    ,
               "there've", "theyd", "theyre", "think", "thou", "though", "thoughh", "thousand", "throug"      ,
               "throughout", "thru", "thus", "til", "tip", "together", "took", "toward", "towards", "tried"   ,
               "tries", "truly", "try", "trying", "ts", "twice", "two", "thats",  "thanks",  "th",  "thered"  ,
               "theres" "t1", "t2", "t3", "tb", "tc", "td", "te", "tf", "th", "ti", "tj", "tl", "tm", "tn"    ,
               "tp", "tq", "tr", "ts", "tt", "tv", "tx"                                                       ,                                                                                        
               
               "u", "un", "unfortunately", "unless", "unlike", "unlikely", "unto", "upon", "ups", "us", "use" ,
               "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "ue", "ui", "uj", "uk" ,
               "um", "un", "uo", "ur", "ut",
               
               "v", "value", "various", "'ve", "via", "viz", "vol", "vols", "vs", "va", "vd", "vj", "vo", "vq",
               "vt", "vu"                                                                                     ,
               
               "w", "want", "wants", "wasnt", "way", "wed", "welcome", "went", "werent", "whatever", "what'll",
               "whats", "whence", "whenever", "whereas", "whereby", "wherein", "wheres", "wherever", "whether",  
               "whim", "whither", "whod", "whoever", "whole", "who'll", "whomever", "whos", "whose", "widely" ,
               "whereupon", "willing", "wish", "within", "without", "wont", "words", "world", "wouldnt", "www",
               "wi", "wa", "wo",
               
               "x", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx",
               
               "yes", "yet", "youd", "youre", "y2", "yj", "yl", "yr", "ys", "yt",
               
               "z", "zero", "zi", "zz"
               
               "best", "better", "c'mon", "c's", "cant", "changes", "clearly", "concerning", "consequently", "consider", "considering", "corresponding", "course", "currently", "definitely", "described", "despite", "entirely", "exactly", "example", "going", "greetings", "hello", "help", "hopefully", "ignored", "inasmuch", "indicate", "indicated", "indicates", "inner", "insofar", "it'd", "keep", "keeps", "novel", "presumably", "reasonably", "second", "secondly", "sensible", "serious", "seriously", "sure", "t's", "third", "thorough", "thoroughly", "three", "well", "wonder", "a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas",                   "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "co", "op", "research-articl", "pagecount", "cit", "ibid", "les", "le", "au", "que", "est", "pas", "vol", "el", "los", "pp", "u201d", "well-b", "http", "volumtype", "par",
               "0o", "0s", "3a", "3b", "3d", "6b", "6o",
               "a1", "a2", "a3", "a4", "ab", "ac", "ad", "ae", "af", "ag", "aj", "al", "an", "ao", "ap", "ar", "av", "aw", "ax", "ay", "az",
               "b1", "b2", "b3", "ba", "bc", "bd", "be", "bi", "bj", "bk", "bl", "bn", "bp", "br", "bs", "bt", "bu", "bx",
               "c1", "c2", "c3", "cc", "cd", "ce", "cf", "cg", "ch", "ci", "cj", "cl", "cm", "cn", "cp", "cq", "cr", "cs", "ct", "cu", "cv", "cx", "cy", "cz",
               "d2", "da", "dc", "dd", "de", "df", "di", "dj", "dk", "dl", "do", "dp", "dr", "ds", "dt", "du", "dx", "dy",
               "e2", "e3", "ea", "ec", "ed", "ee", "ef", "ei", "ej", "el", "em", "en", "eo", "ep", "eq", "er", "es", "et", "eu", "ev", "ex", "ey",
               "f2", "fa", "fc", "ff", "fi", "fj", "fl", "fn", "fo", "fr", "fs", "ft", "fu", "fy",
               "ga", "ge", "gi", "gj", "gl", "go", "gr", "gs", "gy",
               "h2", "h3", "hh", "hi", "hj", "ho", "hr", "hs", "hu", "hy",
               "i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ic", "ie", "ig", "ih", "ii", "ij", "il", "in", "io", "ip", "iq", "ir", "iv", "ix", "iy", "iz",
               "jj", "jr", "js", "jt", "ju",
               "ke", "kg", "kj", "km", "ko",
               "l2", "la", "lb", "lc", "lf", "lj", "ln", "lo", "lr", "ls", "lt",
               "m2", "ml", "mn", "mo", "ms", "mt", "mu",
               
               'i',  'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii','ix', 'x',
               'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx',
                'xxi', 'xxii', 'xxiii', 'xxiv', 'xxv', 'xxvi', 'xxvii', 'xxviii', 'xxix', 'xxx',
                'xxxi', 'xxxii', 'xxxiii', 'xxxiv', 'xxxv', 'xxxvi', 'xxxvii', 'xxxviii', 'xxxix', 'xl',
               'xli', 'xlii', 'xliii', 'xliv', 'xlv', 'xlvi', 'xlvii', 'xlviii', 'xlix', 'l',
               'li', 'lii', 'liii', 'liv', 'lv', 'lvi', 'lvii', 'lviii', 'lix', 'lx',
               'lxi', 'lxii', 'lxiii', 'lxiv', 'lxv', 'lxvi', 'lxvii', 'lxviii', 'lxix', 'lxx',
                'lxxi', 'lxxii', 'lxxiii', 'lxxiv', 'lxxv', 'lxxvi', 'lxxvii', 'lxxviii', 'lxxix', 'lxxx',
                'lxxxi', 'lxxxii', 'lxxxiii', 'lxxxiv', 'lxxxv', 'lxxxvi', 'lxxxvii', 'lxxxviii', 'lxxxix', 'xc',
                'xci', 'xcii', 'xciii', 'xciv', 'xcv', 'xcvi', 'xcvii', 'xcviii', 'xcix', 'c',
               
                "one", "first", "two", "second", "three", "third",
                "four", "fourth", "five", "fifth", "six",  "sixth", "seven",
                "seventh", "eight", "eighth", "nine", "ninth", "ten",
                "tenth", "eleven", "eleventh", "twelve", "twelfth", "thirteen",
                "thirteenth", "fourteen", "fourteenth", "fifteen", "fifteenth",
                "sixteen", "sixteenth",  "seventeen", "seventeenth", "eighteen",
                "eighteenth", "nineteen", "nineteenth", "twenty", "twentieth",
                "one", "22nd", "second", "nd", "st", "rd", "th",
               
                "1","2","3","4","5","6","7","8","9","10th","11th","12th","13th","14th","15th",
                "16th","17th","18th","19th","20th","21st","22nd","23rd","24th","25th","26th","27th",
                "28th","29th","30th","31st","32nd","33rd","34th","35th","36th","37th","38th","39th",
                "40th","41st","42nd","43rd","44th","45th","46th","47th","48th","49th","50th","51st",
                "52nd","53rd","54th","55th","56th","57th","58th","59th","60th","61st","62nd","63rd",
                "64th","65th","66th","67th","68th","69th","70th","71st","72nd","73rd","74th","75th",
                "76th","77th","78th","79th","80th","81st","82nd","83rd","84th","85th","86th","87th",
                "88th","89th","90th", "91st", "92nd", "93rd", "94th", "95th", "96th","97th", "98th",
                "99th","100th","thirty","forty","fifty","thirty","thirtieth","forty","fortieth",
                "fifty", "fiftiethiftieth","sixty","sixtieth","seventy","seventieth", "eighty",
                "eightieth", "ninety", "ninetieth","one", "hundred", "100th", "hundredth",
                "order","state","page","file",
                
                "'d","'ll",  "'m",  "'re",  "'s",  "'ve",  'a',  
                'about',  'above',  'across',  'after',  'afterwards',  'again',  'against',  'all',  
                'almost',  'alone',  'along',  'already',  'also',  'although',  'always',  'am',  
                'among',  'amongst',  'amount',  'an',  'and',  'another',  'any',  'anyhow',  'anyone',  
                'anything',  'anyway',  'anywhere',  'are',  'around',  'as',  'at',  'back',  'be',
                'became',  'because',  'become',  'becomes',  'becoming',  'been',  'before',  'beforehand',
                'behind',  'being',  'below',  'beside',  'besides',  'between',  'beyond',  'both',
                'bottom',  'but',  'by',  'ca',  'call',  'can',  'cannot',  'could',  'did',  'do',  'does',
                'doing',  'done',  'down',  'due',  'during',  'each',  'eight',  'either',  'eleven',
                'else',  'elsewhere',  'empty',  'enough',  'even',  'ever',  'every',  'everyone',
                'everything',  'everywhere',  'except',  'few',  'fifteen',  'fifty',  'first',
                'five',  'for',  'former',  'formerly',  'forty',  'four',  'from',  'front',  'full',
                'further',  'get',  'give',  'go',  'had',  'has',  'have',  'he',  'hence',  'her',
                'here',  'hereafter',  'hereby',  'herein',  'hereupon',  'hers',  'herself',  'him',  'himself',
                'his',  'how',  'however',  'hundred',  'i',  'if',  'in',  'indeed',  'into',  'is',  'it',
                'its',  'itself',  'just',  'keep',  'last',  'latter',  'latterly',  'least',  'less',  'made',
                'make',  'many',  'may',  'me',  'meanwhile',  'might',  'mine',  'more',  'moreover',  'most',
                'mostly',  'move',  'much',  'must',  'my',  'myself',  "n't",  'name',  'namely',  'neither',
                'never',  'nevertheless',  'next',  'nine',  'no',  'nobody',  'none',  'noone',  'nor',  'not',
                'nothing',  'now',  'nowhere',  'n‘t',  'n’t',  'of',  'off',  'often',  'on',  'once',  'one',
                'only',  'onto',  'or',  'other',  'others',  'otherwise',  'our',  'ours',  'ourselves',  'out',
                'over',  'own',  'part',  'per',  'perhaps',  'please',  'put',  'quite',  'rather',  're',  'really',
                'regarding',  'same',  'say',  'see',  'seem',  'seemed',  'seeming',  'seems',  'serious',  'several',
                'she',  'should',  'show',  'side',  'since',  'six',  'sixty',  'so',  'some',  'somehow',  'someone',
                'something',  'sometime',  'sometimes',  'somewhere',  'still',  'such',  'take',  'ten',  'than',
                'that',  'the',  'their',  'them',  'themselves',  'then',  'thence',  'there',  'thereafter',
                'thereby',  'therefore',  'therein',  'thereupon',  'these',  'they',  'third',  'this',  'those',
                'though',  'three',  'through',  'throughout',  'thru',  'thus',  'to',  'together',  'too',  'top',
                'toward',  'towards',  'twelve',  'twenty',  'two',  'under',  'unless',  'until',  'up',  'upon',  'us',
                'used',  'using',  'various',  'very',  'via',  'was',  'we',  'well',  'were',  'what',  'whatever',  'when',
                'whence',  'whenever',  'where',  'whereafter',  'whereas',  'whereby',  'wherein',  'whereupon',  'wherever',
                'whether',  'which',  'while',  'whither',  'who',  'whoever',  'whole',  'whom',  'whose',  'why',  'will',
                'with',  'within',  'without',  'would',  'yet',  'you',  'your',  'yours',  'yourself',  'yourselves',  '‘d',
                '‘ll',  '‘m',  '‘re',  '‘s',  '‘ve',  '’d',  '’ll',  '’m',  '’re',  '’s',  '’ve'

                       
                       ])))



import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

stop_words = stopwords.words("english")

sentence = "PDF.co is a website that contains different tools to read, write and process PDF documents"
words = word_tokenize(sentence)

sentence_wo_stopwords = [word for word in words if not word in stop_words]

print(" ".join(sentence_wo_stopwords))

python - 使用 NLTK 删除停用词

6 回答 6

Related

Reference