0

我有必须用 Python 分析的 CSV 数据。数据中有一些缺失值。数据样本如下:

样本

ID,ID_TYPE,OB_DATE,VERSION_NUM,MET_DOMAIN_NAME,OB_END_CTIME,OB_DAY_CNT,SRC_ID,REC_ST_IND,PRCP_AMT,OB_DAY_CNT_Q,PRCP_AMT_Q,METO_STMP_TIME,MIDAS_STMP_ETIME,PRCP_AMT_J
90, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,24109,1011,0,0,6, 2006-01-17 09:04,0,
150, RAIN, 2006-01-01 00:00,1, DLY3208,900,1,30747,1011,0,0,6, 2006-01-09 13:21,3,
174, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,24775,1011,0.2,0,6, 2006-01-17 09:04,0,
498, RAIN, 2006-01-01 00:00,0, WADRAIN,900,1,1622,1012,0.1,0,1, 2006-01-17 09:04,0,
498, RAIN,,1, WADRAIN,900,31,1622,1022,58.3,0,22576, 2006-03-15 11:41,0,
898, RAIN, 2006-01-01 00:00,0, WADRAIN,900,6,1624,1012,18.5,0,20001,,0,
898, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1624,1022,0.4,0,2576, 2006-03-15 11:41,0,
996, RAIN, 2006-01-01 00:00,1, WAMRAIN,900,31,24953,1011,53.5,0,6, 2006-01-31 13:51,0,
997, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,24953,1011,1.6,0,6, 2006-02-02 12:28,0,
1045, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1628,1011,1.1,0,6, 2006-01-17 09:04,0,
1103, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,24772,1011,2.5,0,6, 2006-01-17 09:04,0,
1358, RAIN, 2006-01-01 00:00,0, WADRAIN,900,11,1633,1012,17.7,0,20001,,0,
1358, RAIN,,1, WADRAIN,900,31,1633,1022,42.5,0,22576, 2006-03-15 11:41,0,
1545, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1636,1011,2,0,6, 2006-01-17 09:04,0,
1584, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,315,1014,2.4,0,2306, 2006-03-15 11:41,0,
1858, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1645,1011,0.2,0,6, 2006-01-17 09:04,0,
2247, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,24781,1011,0.5,0,6, 2006-01-17 09:04,0,
3066, RAIN,,1, WADRAIN,900,1,1655,1011,0.6,0,6, 2006-02-02 12:28,0,
3067, RAIN, 2006-01-01 00:00,0, WADRAIN,900,7,1655,1012,11,0,20001, 2006-01-26 15:08,0,
3067, RAIN, 2006-01-01 00:00,1, WADRAIN,900,31,1655,1022,57.5,0,22576, 2006-03-15 11:41,0,
3507, RAIN, 2006-01-01 00:00,0, WADRAIN,900,2,1657,1012,15.8,0,20001,,0,
3507, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1657,1022,0.9,0,2576, 2006-04-13 13:28,0,
4802, RAIN,,0, WADRAIN,900,6,1663,1012,18,0,20001, 2006-01-17 09:04,0,
4802, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1663,1022,0.9,0,2576, 2006-03-15 11:41,0,
4941, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1664,1011,0.5,0,6, 2006-01-17 09:04,1,
4942, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1664,1011,1.2,0,6, 2006-02-02 12:28,0,

数据有一些缺失OB_DATEMETO_STMP_TIME,我想估算这些字段中的缺失值。

这里的基本问题是:

  1. 缺失值的估算是什么?我们可以采用哪些方法?

我为此搜索了很多,但我不清楚插补的概念。

  1. 我们如何在不使用任何外部库的情况下在 Python 中做到这一点?

如果使用外部库,那很好,但这是他们在没有任何外部库的情况下实现它的可能方法。

4

1 回答 1

-1

我是初学者,希望对您有所帮助!

import pandas as pd
dataset=pd.read_csv('filename/path')
from sklearn.preprocessing import Imputer
imputer=Imputer(missing_values='Nan',strategy='mean',axis=0)
X=dataset.iloc[:,2].values
Y=dataset.iloc[:,-3].values
#lets do second column first
imputer=imputer.fit(X[:,2])
X[:,2]=imputer.transform(X[:,2])
# third last column
imputer=imputer.fit(Y[:,-3])
Y[:,-3]=imputer.transform(Y[:,-3])
于 2018-09-18T20:59:13.997 回答