1

我正在尝试使用多项式朴素贝叶斯创建文本分类模型。我的数据有 10 种不同类型的类别。在模型训练期间,我以整数格式表示这些类别。

topics = ["gis","security","photo","mathematica","unix","wordpress","scifi","electronics","android","apple"]
topic2label = {topics[i]:i for i in range(len(topics))}

训练数据格式:

{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit .  \n\nWhat is the effective capacitance of this circuit and will the ...\r\n        "}
{"topic":"electronics","question":"Heat sensor with fan cooling","excerpt":"Can I know which component senses heat or acts as heat sensor in the following circuit?\nIn the given diagram, it is said that the 4148 diode acts as the sensor. But basically it is a zener diode and ...\r\n        "}

这就是我的代码片段的样子:

# ---------------------------------------- Training -------------------------------------
import sklearn
with open('training.json') as f:
    next(f)
        for line in f:
            data = json.loads(line)
            topic.append(data["topic"])
            que = data["question"]
            question.append(data["question"])
            excer = data["excerpt"]
            excerpt.append(data["excerpt"])
            combo.append(que +" "+ excer)

unique_topics = list(set(topic))
numeric_topics = [name.replace('gis', '1').replace('security', '2').replace('photo', '3').replace('mathematica', '4').replace('unix', '5').replace('wordpress', '6').replace('scifi', '7').replace('electronics', '8').replace('android', '9').replace('apple', '10') for name in new_topic]
x1 = np.array(question)
x2 = np.array(excerpt)
x3 = np.array(combo)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1,2),stop_words="english") 
X = vectorizer.fit_transform(x3)
Y = np.array(new_topic)
clf = MultinomialNB(alpha=0.1).fit(X, Y)

# ----------------------------   Prediction -----------------------------------------

docs_new = []

input = int(raw_input())
for i in xrange(input):
    input_data = raw_input()
    data = json.loads(input_data)
    que = data["question"]
    excer = data["excerpt"]
    docs_new.append(que +" "+ excer)

X_new_counts = vectorizer.transform(docs_new)
predicted = clf.predict(X_new_counts)
predicted =  list(predicted)
for i in predicted:
    print i

现在我分析了一个奇怪的行为,在使用类别的整数表示时,我的模型的准确率为 82%,如果我使用字符串表示,准确率飙升至 90%。

我的问题是为什么模型在第二种情况下表现不同(更好)?

PS我正在使用sklearn库。

4

0 回答 0