我是 Scikit Learn 的新手,我正在从事一个涉及对大约 70000 个网页 ~250MB 文件进行多标签分类的项目。由于文件的大小,我不得不使用 out of core 分类。这些页面的标签是 dmoz 类别。因此,每个页面可以有多个标签。
我通过改编 scikit-learn 的核心示例创建了下面的代码。但是,下面的代码只为每个文档打印一个标签。
1)有什么方法可以按概率打印每个文档的前 5 个标签?我将不胜感激对代码的任何指针/修改。
2) 鉴于 OneVsRest 不提供 partial_fit 方法,什么是支持此任务的多标签分类的好分类器
file_training_combined.csv 中的文本如下所示
"http://home.earthlink.net/~rvbears/","RV Resources - Camping Information - RV Accessories","","","","","RV Resources - Camping Information - RV Accessories RV Resources\, Camping Resources\, Camping Information RV\, Camping Resources and Information! For Campers\, Travel Trailers\, Motorhome and Fifth Wheels Owners Camping Games Camping Recipes Camping Cooking Supplies RV Books RV E-Books RV Videos/DVD RV Links Looking for rv and camping information\, this is it! Check in here for lots of great resources and information especially for newbies. From Camping Gear\, to RV Books\, E-Books\, and Videos our pages are filled with information about everything to do with Camping and RVing to get you headed in the right direction\, from companies you can trust. Refer to the RV Links section for lots of camping gear and rv accessories\, find just about anything that you are looking for. Coming Back Soon....Our ""PRODUCT REVIEWS BLOG"" Will we be returning to reviewing our best bets on some of the newest camping gadgets for inside and outside your rv or tent. Emergency medical & travel assistance for less than 22 cents a day. Good Sam TravelAssist. Learn More! With over 2 million rescues and recoveries and counting\, Good Sam Roadside Assistance gives our members peace of mind when they travel. RV Accessories\, RV Decor\, RV Books\, RV E-books\, RV Videos\, RV DVDs RV Resources\, Camping Resources\, Camping Information NOTE: RV Ladders Bears are now SOLD OUT Home | Woodworking Links | Link To Us Copyright 2002-2014 GoCampin'. All Rights Reserved. Go Campin' ~ PO BOX 25417 ~ Greenville\, SC 29616-0417","/Top/Shopping/Crafts/Woodcraft/Decorative|/Top/Shopping/Crafts/Woodcraft/HomeDecor"
这只是 CSV 文件中的一行。我正在使用第 6 列中的文本,标签位于第 7 列中,由 | 分隔。
import codecs
import itertools
import time
import csv
import sys
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
__author__ = 'prateek.jain'
sep = b","
quote_char = b'"'
stop = stopwords.words('english')
porter = PorterStemmer()
text_rows = []
text_labels = []
training_file_object = codecs.open('file_training_combined.csv','r', 'utf-8')
wr1 = csv.reader(training_file_object, dialect='excel', quotechar=quote_char, quoting=csv.QUOTE_ALL, delimiter=sep)
output_file = 'output.csv'
output_file_object = open(output_file, 'w')
for row in wr1:
labels = row[7].strip().split('|')
empty_list = []
for label in labels:
if not ('http:' in label.lower() or 'www:' in label.lower()):
def tokenizer(text):
text = re.sub('<[^>]*>', '', text)
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
text = [w for w in text.split() if w not in stop]
tokenized = [porter.stem(w) for w in text]
return text
# dialect='excel'
def stream_docs(path):
training_file_object = codecs.open(path, 'r', 'utf-8')
wr1 = csv.reader(training_file_object, dialect='excel', quotechar=quote_char, quoting=csv.QUOTE_ALL, delimiter=sep)
for row in wr1:
text, label = row[6], row[7]
labels = label.split('|')
empty_list = []
for label in labels:
if not ('http:' in label.lower() or 'www:' in label.lower()):
yield text, empty_list
def get_minibatch(doc_stream, size):
docs, y = [], []
for _ in range(size):
text, label = next(doc_stream)
return docs, y
from sklearn.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(decode_error='ignore',
n_features=2 ** 10,
non_negative=True, )
clf = MultinomialNB()
doc_stream = stream_docs(path='file_training_combined.csv')
merged = list(itertools.chain(*text_labels))
my_set = set(merged)
class_label_list = list(my_set)
all_class_labels = np.array(class_label_list)
mlb = MultiLabelBinarizer(all_class_labels)
X_test_text, y_test = get_minibatch(doc_stream, 1000)
X_test = vect.transform(X_test_text)
classes = np.array([0, 1])
tick = time.time()
accuracy = 0
total_fit_time = 0
n_train_pos = 0
for _ in range(45):
X_train, y_train = get_minibatch(doc_stream, size=1000)
X_train_matrix = vect.fit_transform(X_train)
y_train = mlb.fit_transform(y_train)
print X_train_matrix.shape, ' ', y_train.shape
clf.partial_fit(X_train_matrix.toarray(), y_train, classes=all_class_labels)
total_fit_time += time.time() - tick
n_train = X_train_matrix.shape[0]
n_train_pos += sum(y_train)
tick = time.time()
predicted = clf.predict(X_test)
all_labels = predicted
for item, labels in zip(X_train, all_labels):
print '%s => %s' % (item, labels)
output_file_object.write('%s => %s' % (item, labels) + '\n')