python - 熊猫中的fillna运行两次有什么问题？

Question

我是 Pandas 和 Numpy 的新手。我试图解决Kaggle | 泰坦尼克号数据集。现在我必须修复“Age”和“Embarked”这两列，因为它们包含 NAN。

现在我尝试了fillna没有任何成功，很快就发现我错过了inplace = True.

现在我附上了它们。但是第一次插补成功了，第二次没有成功。我尝试在 SO 和 google 中搜索，但没有找到任何有用的东西。请帮我。

这是我正在尝试的代码。

# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)
    
# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)

print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size

我得到的输出为

0
2

但是我设法在不使用的情况下得到了我想要的inplace=True

titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())

但我很好奇second usageof是什么inplace=True。

如果我问的是非常愚蠢的问题，请耐心等待，因为我是全新的，我可能会错过一些小事。任何帮助表示赞赏。提前致谢。

score 4 · Accepted Answer

pd.Series.mode返回一个系列。

一个变量有一个算术平均值和一个中位数，但它可能有多种众数。如果多个值具有最高频率，则将存在多种模式。

pandas 对标签进行操作。

titanic_df.mean()
Out: 
PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

如果我要使用titanic_df.fillna(titanic_df.mean())它将返回一个新的 DataFrame，其中列PassengerId 填充 446.0，列Survived 填充 0.38，依此类推。

但是，如果我在 Series 上调用 mean 方法，则返回值是浮点数：

titanic_df['Age'].mean()
Out: 29.69911764705882

这里没有关联的标签。因此，如果我使用titanic_df.fillna(titanic_df['Age'].mean())所有列中的所有缺失值，将用 29.699 填充。

为什么第一次尝试没有成功

您试图titanic_df 用titanic_df["Embarked"].mode(). 让我们首先检查输出：

titanic_df["Embarked"].mode()
Out: 
0    S
dtype: object

它是一个具有单个元素的系列。索引为 0，值为 S。现在，请记住，如果我们使用titanic_df.mean()填充，它将如何工作：它将用相应的平均值填充每一列。在这里，我们只有一个标签。因此，如果我们有一个名为的列，它只会填充值0。尝试df[0] = np.nan 再次添加并执行您的代码。您会看到新列已填充S.

为什么第二次尝试（不）成功

等式的右边，titanic_df.fillna(titanic_df["Embarked"].mode()) 返回一个新的 DataFrame。在这个新的 DataFrame 中，Embarked列仍然有nan：

titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
Out: 2

但是，您没有将其分配回整个 DataFrame。您将此 DataFrame 分配给 Series - titanic_df['Embarked']。它实际上并没有填充Embarked 列中的缺失值，它只是使用了 DataFrame 的索引值。如果您实际检查新列，您会看到数字 1、2、... 而不是 S、C 和 Q。

你应该做什么

您正在尝试用单个值填充单个列。首先，将该值与其标签分离：

titanic_df['Embarked'].mode()[0]
Out: 'S'

现在，是否使用inplace=True 或分配结果并不重要。两个都

titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])

和

titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)

将用填充 Embarked 列中的缺失值S。

当然，这假设您想在有多种模式时使用第一个值。您可能需要在那里改进您的算法（例如，如果有多种模式，则从值中随机选择）。

python - 熊猫中的fillna运行两次有什么问题？

1 回答 1

pd.Series.mode返回一个系列。

pandas 对标签进行操作。

为什么第一次尝试没有成功

为什么第二次尝试（不）成功

你应该做什么

Related

Reference