mysql - WordNet 3.0 诅咒词

Question

我正在开发一个从纯文本中提取关键字的系统。

关键字的要求是：

长度在 1 - 45 个字母之间
Word 必须存在于 WordNet 数据库中
不能是“普通”字
不能是脏话

我已经满足了要求 1 - 3，但是我找不到区分脏话的方法；我该如何过滤它们？

我知道这不会是过滤掉所有脏话的明确方法，但会发生的是所有关键字在被版主“批准”之前首先设置为“待定”状态。但是，如果我能让 WordNet 过滤掉大部分的脏话，它会让版主的工作更轻松。

score 3 · Accepted Answer

It's strange, the Unix command line version of WordNet (wn) will give you the desired information with the option -domn (domain):

wn ass -domnn (-domnv for a verb)

...
>>> USAGE->(noun) obscenity#2, smut#4, vulgarism#1, filth#4, dirty word#1
>>> USAGE->(noun) slang#2, cant#3, jargon#1, lingo#1, argot#1, patois#1, vernacular#1

However, the equivalent method in the NLTK just returns an empty list:

from nltk.corpus import wordnet
a = wordnet.synsets('ass')
for s in a:
    for l in s.lemmas:
        print l.usage_domains()

[]
[]
...

As an alternative you could try to filter words that have "obscene", "coarse" or "slang" in their SynSet's definition. But probably it's much easier to filter against a fixed list as suggested before (like the one at noswearing.com).

Update: There is also a curse word filter API at Mashape.

score 0 · Accepted Answer

对于第四点，如果你能收集到脏话列表并通过迭代过程将它们删除，那就更好了。

为了达到同样的效果，您可以查看此博客

我将在这里总结一下。1. 从这里加载 Swear words 文本文件 2. 将其与文本进行比较，如果匹配则删除。

def remove_curse_words():
    text = 'Hey Bro Fuck you'
    text = ' '.join([word for word in text.split() if word not in curseWords])
    return text

输出将是。

嘿兄弟你

mysql - WordNet 3.0 诅咒词

2 回答 2

Related

Reference