客观的
- 给定一个 excel 文件(充满拼写错误),使用 FuzzyWuzzy 将拼写错误与
accepted
列表进行比较和匹配。 accepted
用最接近的匹配更正错字填充的 excel 文件。
方法
- 使用熊猫导入 Excel 文件
- 将原始的、错字填充的 excel 文件推送到数据框中
- 创建
accepted
数据框 accepted
使用FuzzyWuzzy比较错字数据框和数据框- 返回原始拼写、接受的拼写和匹配分数
- 将关联的、可接受的拼写附加到所有拼写的原始 excel 文件/行
代码
#Load Excel File into dataframe
xl = pd.read_excel(open("/../data/expenses.xlsx",'rb'))
#Let's clarify how many similar categories exist...
q = """
SELECT DISTINCT Expense
FROM xl
ORDER BY Expense ASC
"""
expenses = sqldf(q)
print(expenses)
#Let's add some acceptable categories and use fuzzywuzzy to match
accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']
#select from the list of accepted values and return the closest match
process.extractOne("Company Acquired",accepted,scorer=fuzz.token_set_ratio)
('Acquisition Fees', 38) 分数不高,但足够高,可以返回预期的输出
!!!!!问题!!!!!
#Time to loop through all the expenses and use FuzzyWuzzy to generate and return the closest matches.
def correct_expense(expense):
for expense in expenses:
return expense, process.extractOne(expense,accepted,scorer = fuzz.token_set_ratio)
correct_expense(expenses)
('费用', ('法律费用', 47))
问题
- 如您所见,process.extractOne 在逐个测试时运行正确。但是,在循环中运行时,返回值是意外的。我相信我可能会抓住第一列或最后一列,但即使是这样,我也希望“董事费”或“收购”会弹出(参见原始 excel 文件)。