python-2.7 - 无法使用 Pandas 和 NLTK 在 Python 中训练朴素贝叶斯（机器学习）

Question

这是我想要做的。我有一个.csv。第 1 列包含人名（即：“Michael Jordan”、“Anderson Silva”、“Muhammad Ali”），第 2 列包含人们的种族（即：英语、法语、中文）。

在我的代码中，我使用所有数据创建了 pandas 数据框。然后创建额外的数据框：一个只有中文名字，另一个只有非中文名字。然后我创建单独的列表。

three_split 函数通过将每个名称拆分为三个字符的子字符串来提取它们的特征。例如，将“Katy Perry”转化为“kat”、“aty”、“ty”、“y p”……等。

然后我用朴素贝叶斯训练，最后测试结果。

运行我的代码时没有任何错误，但是当我尝试直接从数据库中使用非中文名称并期望程序返回 False（不是中文）时，它会为我测试的任何名称返回 True（中文）。任何想法？

import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.classify import PositiveNaiveBayesClassifier

# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Dropbox\Python exercises_KW\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\OddNames_sampleData3.csv", 
    encoding="utf-8")
df = DataFrame(data)
df.columns = ["name", "ethnicity"]

# Recategorize different ethnicities into 1) Chinese or 2) non-Chinese; and then create separate lists
df_chinese = df[(df["ethnicity"] == "chinese") | (df["ethnicity"] == "Chinese")]
chinese_names = list(df_chinese["name"])

df_nonchinese = df[(df["ethnicity"] != "chinese") & (df["ethnicity"] != "Chinese") & (df["ethnicity"].notnull() == True)]
nonchinese_names = list(df_nonchinese["name"])

# Function to split word string into three-character substrings
def three_split(word):
    word = str(word).lower().replace(" ", "_")
    split = 3
    return dict(("contains(%s)" % word[start:start+split], True) 
        for start in range(0, len(word)-2))

# Training naive bayes machine learning algorithm
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)

# Testing results
name = "Hubert Gillies" # A non-Chinese name from the dataset
print classifier.classify(three_split(name))
>>> True # Wrong output

score 0 · Accepted Answer

当你没有得到想要的结果时，可能会有很多问题，通常是：

功能不够强大
训练数据不足
错误的分类器
NLTK 分类器中的代码错误

对于前 3 个原因，除非您发布指向数据集的链接，否则无法验证/解决，我们来看看如何修复它。至于最后一个原因，基本NaiveBayes和PositiveNaiveBayes分类器不应该有一个。

所以要问的问题是：

你有多少训练数据实例（即行）？
为什么在提取特征之前阅读数据集后不规范化标签（即中文|中文 -> 中文）？
还需要考虑哪些其他功能？
您是否考虑过使用 NaiveBayes 而不是 PositiveNaiveBayes？

python-2.7 - 无法使用 Pandas 和 NLTK 在 Python 中训练朴素贝叶斯（机器学习）

1 回答 1

Related

Reference