python - to_dict 的奇怪行为

Question

我正在构建一个模糊搜索程序，使用 FuzzyWuzzy 在数据集中查找匹配的名称。正如预期的那样，我的数据位于大约 10378 行len(df['Full name'])的 DataFrame 中，为 10378。但是len(choices)只有1695。

我在 IPython Notebook 中运行 Python2.7.10和 pandas 。0.17.0

choices = df['Full name'].astype(str).to_dict()
def fuzzy_search_to_df (term, choices=choices):
    search = process.extract(term, choices, limit=len(choices)) # does the search itself
    rslts = pd.DataFrame(data=search, index=None, columns=['name', 'rel', 'df_ind']) # puts the results in DataFrame form
    return rslts
results = fuzzy_search_to_df(term='Ben Franklin') # returns the search result for the given term
matches = results[results.rel > 85] # subset of results, these are the best search results
find = df.iloc[matches['df_ind']] # matches in the main df

正如您可能知道的choices那样，我在 dict 中获得了结果的索引df_ind，我认为它与主数据框中的索引相同。

我相当确定问题出在to_dict()函数的第一行，len(df['Full name'].astype(str)结果为 10378，len(df['Full name'].to_dict())结果为 1695。

score 3 · Accepted Answer

问题是您的数据框中有多行，其中索引相同，因此由于 Python 字典只能为单个 key 保存单个值，并且在Series.to_dict()方法中，索引用作键，值来自这些行被后面的值覆盖。

一个非常简单的例子来展示这种行为 -

In [36]: df = pd.DataFrame([[1],[2]],index=[1,1],columns=['A'])

In [37]: df
Out[37]:
   A
1  1
1  2

In [38]: df['A'].to_dict()
Out[38]: {1: 2}

这就是您的情况所发生的情况，并从评论中注意到，由于unique索引的值数量仅为1695，我们可以通过测试的值来确认这一点len(df.index.unique())。

如果您对将数字作为key(数据帧的索引) 感到满意，那么您可以使用重置索引DataFrame.reset_index()，然后.to_dict()在其上使用。例子 -

choices = df.reset_index()['Full name'].astype(str).to_dict()

从上面的例子演示 -

In [40]: df.reset_index()['A'].to_dict()
Out[40]: {0: 1, 1: 2}

这与 OP 找到的解决方案相同—— choices = dict(zip(df['n'],df['Full name'].astype(str)))（从评论中可以看出）——但这种方法比使用zipand更快dict。

python - to_dict 的奇怪行为

1 回答 1

Related

Reference