这是我想要做的。我有一个.csv。第 1 列包含人名(即:“Michael Jordan”、“Anderson Silva”、“Muhammad Ali”),第 2 列包含人们的种族(即:英语、法语、中文)。
在我的代码中,我使用所有数据创建了 pandas 数据框。然后创建额外的数据框:一个只有中文名字,另一个只有非中文名字。然后我创建单独的列表。
three_split 函数通过将每个名称拆分为三个字符的子字符串来提取它们的特征。例如,将“Katy Perry”转化为“kat”、“aty”、“ty”、“y p”……等。
然后我用朴素贝叶斯训练,最后测试结果。
运行我的代码时没有任何错误,但是当我尝试直接从数据库中使用非中文名称并期望程序返回 False(不是中文)时,它会为我测试的任何名称返回 True(中文)。任何想法?
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.classify import PositiveNaiveBayesClassifier
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Dropbox\Python exercises_KW\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\OddNames_sampleData3.csv",
encoding="utf-8")
df = DataFrame(data)
df.columns = ["name", "ethnicity"]
# Recategorize different ethnicities into 1) Chinese or 2) non-Chinese; and then create separate lists
df_chinese = df[(df["ethnicity"] == "chinese") | (df["ethnicity"] == "Chinese")]
chinese_names = list(df_chinese["name"])
df_nonchinese = df[(df["ethnicity"] != "chinese") & (df["ethnicity"] != "Chinese") & (df["ethnicity"].notnull() == True)]
nonchinese_names = list(df_nonchinese["name"])
# Function to split word string into three-character substrings
def three_split(word):
word = str(word).lower().replace(" ", "_")
split = 3
return dict(("contains(%s)" % word[start:start+split], True)
for start in range(0, len(word)-2))
# Training naive bayes machine learning algorithm
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)
# Testing results
name = "Hubert Gillies" # A non-Chinese name from the dataset
print classifier.classify(three_split(name))
>>> True # Wrong output