假设我的文本数据如下图,以列表的形式。
l = ['have approved 13 request its showing queue note data been sync move out these request from queue', 'note have approved 12 requests its showing queue note data been sync move out all request from queue', 'have approved 2 request its showing queue note data been sync move out of these 2 request ch 30420 cr 13861']
我正在使用 TFIDFVectorizer 和 DBSCAN Clustering 来聚类此文本并给它们一个标签。
vect = TfidfVectorizer(ngram_range=(3,4), min_df = 1, max_df = 1.0, decode_error = "ignore")
tfidf = vect.fit_transform(l)
a = (tfidf * tfidf.T).A
db_a = DBSCAN(eps=0.3, min_samples=5).fit(a)
lab = db_a.labels_
print lab
我得到的输出为
`array([-1, -1, -1])`
所以基本上 DBSCAN 将我的所有数据标记为“-1”,这将其归类为噪声,如 sklearn DBSCAN 文档中所述。