78

我有一些带有文本类型列的熊猫数据。这些文本列有一些 NaN 值。我要做的是通过sklearn.preprocessing.Imputer(用最常见的值替换 NaN)来估算那些 NaN。问题在于实施。假设有一个 Pandas 数据框 df,它有 30 列,其中 10 列是分类性质的。一旦我运行:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df) 

Python 生成一个error: 'could not convert string to float: 'run1'',其中 'run1' 是具有分类数据的第一列的普通(非缺失)值。

非常欢迎任何帮助

4

11 回答 11

106

要使用数值列的平均值和非数值列的最常见值,您可以执行以下操作。您可以进一步区分整数和浮点数。我想将中位数用于整数列可能是有意义的。

import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

data = [
    ['a', 1, 2],
    ['b', 1, 1],
    ['b', 2, 2],
    [np.nan, np.nan, np.nan]
]

X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)

print('before...')
print(X)
print('after...')
print(xt)

打印,

before...
     0   1   2
0    a   1   2
1    b   1   1
2    b   2   2
3  NaN NaN NaN
after...
   0         1         2
0  a  1.000000  2.000000
1  b  1.000000  1.000000
2  b  2.000000  2.000000
3  b  1.333333  1.666667
于 2014-08-29T06:44:58.287 回答
13

您可以sklearn_pandas.CategoricalImputer用于分类列。细节:

首先,(来自《Hands-On Machine Learning with Scikit-Learn and TensorFlow》一书)您可以拥有用于数字和字符串/分类特征的子管道,其中每个子管道的第一个转换器是一个选择器,它采用列名列表(并且full_pipeline.fit_transform()采用熊猫数据框):

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

然后,您可以将这些子管道与 结合起来sklearn.pipeline.FeatureUnion,例如:

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline)
])

现在,在 中num_pipeline您可以简单地使用sklearn.preprocessing.Imputer(),但在 中cat_pipline,您可以CategoricalImputer()sklearn_pandas包中使用。

注意: sklearn-pandas包可以用 安装pip install sklearn-pandas,但导入为import sklearn_pandas

于 2018-01-09T06:51:35.993 回答
8

有一个包sklearn-pandas可以对分类变量进行插补 https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer

>>> from sklearn_pandas import CategoricalImputer
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
>>> imputer = CategoricalImputer()
>>> imputer.fit_transform(data)
array(['a', 'b', 'b', 'b'], dtype=object)
于 2018-11-15T10:21:45.593 回答
5
  • strategy = 'most_frequent' 只能用于定量特征,不能用于定性特征。这种定制的 impuer 可用于定性和定量。同样使用 scikit learn imputer,我们可以将它用于整个数据帧(如果所有特征都是定量的),或者我们可以将“for loop”与相似类型的特征/列列表一起使用(参见下面的示例)。但是自定义 imputer 可以与任何组合一起使用。

        from sklearn.preprocessing import Imputer
        impute = Imputer(strategy='mean')
        for cols in ['quantitative_column', 'quant']:  # here both are quantitative features.
              xx[cols] = impute.fit_transform(xx[[cols]])
    
  • 自定义输入器:

       from sklearn.preprocessing import Imputer
       from sklearn.base import TransformerMixin
    
       class CustomImputer(TransformerMixin):
             def __init__(self, cols=None, strategy='mean'):
                   self.cols = cols
                   self.strategy = strategy
    
             def transform(self, df):
                   X = df.copy()
                   impute = Imputer(strategy=self.strategy)
                   if self.cols == None:
                          self.cols = list(X.columns)
                   for col in self.cols:
                          if X[col].dtype == np.dtype('O') : 
                                 X[col].fillna(X[col].value_counts().index[0], inplace=True)
                          else : X[col] = impute.fit_transform(X[[col]])
    
                   return X
    
             def fit(self, *_):
                   return self
    
  • 数据框:

          X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san 
                                     francisco', 'tokyo'], 
              'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 
              'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 
                                'somewhat like', 'dislike'], 
              'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
    
    
                city              boolean   ordinal_column  quantitative_column
            0   tokyo             yes       somewhat like   1.0
            1   NaN               no        like            11.0
            2   london            NaN       somewhat like   -0.5
            3   seattle           no        like            10.0
            4   san francisco     no        somewhat like   NaN
            5   tokyo             yes       dislike         20.0
    
  • 1) 可与类似类型的功能列表一起使用。

     cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
     cci.fit_transform(X)
    
  • 可与策略 = 中位数一起使用

     sd = CustomImputer(['quantitative_column'], strategy = 'median')
     sd.fit_transform(X)
    
  • 3)可以与整个数据框一起使用,它将使用默认均值(或者我们也可以使用中值更改它。对于定性特征,它使用策略 = 'most_frequent' 和定量均值/中值。

     call = CustomImputer()
     call.fit_transform(X)   
    
于 2018-11-13T05:48:50.397 回答
4

复制和修改 sveitser 的答案,我为 pandas.Series 对象做了一个 imputer

import numpy
import pandas 

from sklearn.base import TransformerMixin

class SeriesImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        If the Series is of dtype Object, then impute with the most frequent object.
        If the Series is not of dtype Object, then impute with the mean.  

        """
    def fit(self, X, y=None):
        if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
        else                            : self.fill = X.mean()
        return self

    def transform(self, X, y=None):
       return X.fillna(self.fill)

要使用它,你会这样做:

# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])


a  = SeriesImputer()   # Initialize the imputer
a.fit(s1)              # Fit the imputer
s2 = a.transform(s1)   # Get a new series
于 2017-03-17T13:54:40.150 回答
2

受到这里的答案的启发,并且为了所有用例都需要一个 goto Imputer,我最终写了这个。它支持mean, mode, median, fillpd.DataFrame和上进行插补的四种策略Pd.Series

mean并且median仅适用于数值数据,mode并且fill适用于数值和分类数据。

class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy='mean',filler='NA'):
       self.strategy = strategy
       self.fill = filler

    def fit(self, X, y=None):
       if self.strategy in ['mean','median']:
           if not all(X.dtypes == np.number):
               raise ValueError('dtypes mismatch np.number dtype is \
                                 required for '+ self.strategy)
       if self.strategy == 'mean':
           self.fill = X.mean()
       elif self.strategy == 'median':
           self.fill = X.median()
       elif self.strategy == 'mode':
           self.fill = X.mode().iloc[0]
       elif self.strategy == 'fill':
           if type(self.fill) is list and type(X) is pd.DataFrame:
               self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)])
       return self

   def transform(self, X, y=None):
       return X.fillna(self.fill)

用法

>> df   
    MasVnrArea  FireplaceQu
Id  
1   196.0   NaN
974 196.0   NaN
21  380.0   Gd
5   350.0   TA
651 NaN     Gd


>> CustomImputer(strategy='mode').fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   Gd
974 196.0   Gd
21  380.0   Gd
5   350.0   TA
651 196.0   Gd

>> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   NA
974 196.0   NA
21  380.0   Gd
5   350.0   TA
651 0.0     Gd 
于 2017-11-07T20:58:41.693 回答
1

sklearn.impute.SimpleImputer 而不是 Imputer 可以轻松解决这个问题,它可以处理分类变量。

根据 Sklearn 文档:如果“最频繁”,则使用每列中最频繁的值替换缺失。可用于字符串或数字数据。

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

impute_size=SimpleImputer(strategy="most_frequent") 
data['Outlet_Size']=impute_size.transform(data[['Outlet_Size']])
于 2020-09-23T18:33:41.003 回答
1

Missforest 可用于对分类变量中的缺失值以及其他分类特征进行插补。它以类似于以随机森林为基础模型的 IterativeImputer 的迭代方式工作。

以下是标记编码特征以及目标变量的代码,拟合模型以估算 nan 值,并将特征编码回

import sklearn.neighbors._base
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest

def label_encoding(df, columns):
    """
    Label encodes the set of the features to be used for imputation
    Args:
        df: data frame (processed data)
        columns: list (features to be encoded)
    Returns: dictionary
    """
    encoders = dict()
    for col_name in columns:
        series = df[col_name]
        label_encoder = LabelEncoder()
        df[col_name] = pd.Series(
            label_encoder.fit_transform(series[series.notnull()]),
            index=series[series.notnull()].index
        )
        encoders[col_name] = label_encoder
    return encoders

# adding to be imputed global category along with features
features = ['feature_1', 'feature_2', 'target_variable']
# label encoding features
encoders = label_encoding(data, features)
# categorical imputation using random forest 
# parameters can be tuned accordingly
imp_cat = MissForest(n_estimators=50, max_depth=80)
data[features] = imp_cat.fit_transform(data[features], cat_vars=[0, 1, 2])
# decoding features
for variable in features:
    data[variable] = encoders[variable].inverse_transform(data[variable].astype(int))
于 2021-08-11T16:37:52.427 回答
1

这段代码用最频繁的类别填充了一个系列:

import pandas as pd
import numpy as np

# create fake data 
m = pd.Series(list('abca'))
m.iloc[1] = np.nan #artificially introduce nan

print('m = ')
print(m)

#make dummy variables, count and sort descending:
most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0] 

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

new_m = m.map(replace_most_common) #apply function to original data

print('new_m = ')
print(new_m)

输出:

m =
0      a
1    NaN
2      c
3      a
dtype: object

new_m =
0    a
1    a
2    c
3    a
dtype: object
于 2016-06-13T21:37:47.953 回答
0

相似的。修改Imputerstrategy='most_frequent'

class GeneralImputer(Imputer):
    def __init__(self, **kwargs):
        Imputer.__init__(self, **kwargs)

    def fit(self, X, y=None):
        if self.strategy == 'most_frequent':
            self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
            self.statistics_ = self.fills.values
            return self
        else:
            return Imputer.fit(self, X, y=y)

    def transform(self, X):
        if hasattr(self, 'fills'):
            return pd.DataFrame(X).fillna(self.fills).values.astype(str)
        else:
            return Imputer.transform(self, X)

wherepandas.DataFrame.mode()查找每列的最频繁值,然后pandas.DataFrame.fillna()用这些值填充缺失值。其他strategy值仍然以相同的方式处理Imputer

于 2017-07-21T03:01:05.437 回答
0

您可以尝试以下方法:

replace = df.<yourcolumn>.value_counts().argmax()

df['<yourcolumn>'].fillna(replace, inplace=True) 

于 2020-02-17T15:50:42.230 回答