0

我正在开发一个从纯文本中提取关键字的系统。

关键字的要求是:

  1. 长度在 1 - 45 个字母之间
  2. Word 必须存在于 WordNet 数据库中
  3. 不能是“普通”字
  4. 不能是脏话

我已经满足了要求 1 - 3,但是我找不到区分脏话的方法;我该如何过滤它们?

我知道这不会是过滤掉所有脏话的明确方法,但会发生的是所有关键字在被版主“批准”之前首先设置为“待定”状态。但是,如果我能让 WordNet 过滤掉大部分的脏话,它会让版主的工作更轻松。

4

2 回答 2

3

It's strange, the Unix command line version of WordNet (wn) will give you the desired information with the option -domn (domain):

wn ass -domnn (-domnv for a verb)

...
>>> USAGE->(noun) obscenity#2, smut#4, vulgarism#1, filth#4, dirty word#1
>>> USAGE->(noun) slang#2, cant#3, jargon#1, lingo#1, argot#1, patois#1, vernacular#1

However, the equivalent method in the NLTK just returns an empty list:

from nltk.corpus import wordnet
a = wordnet.synsets('ass')
for s in a:
    for l in s.lemmas:
        print l.usage_domains()

[]
[]
...

As an alternative you could try to filter words that have "obscene", "coarse" or "slang" in their SynSet's definition. But probably it's much easier to filter against a fixed list as suggested before (like the one at noswearing.com).

Update: There is also a curse word filter API at Mashape.

于 2012-09-11T20:33:14.637 回答
0

对于第四点,如果你能收集到脏话列表并通过迭代过程将它们删除,那就更好了。

为了达到同样的效果,您可以查看博客

我将在这里总结一下。1. 从这里加载 Swear words 文本文件 2. 将其与文本进行比较,如果匹配则删除。

def remove_curse_words():
    text = 'Hey Bro Fuck you'
    text = ' '.join([word for word in text.split() if word not in curseWords])
    return text

输出将是。

嘿兄弟你

于 2017-11-01T10:02:08.280 回答