从数据集澳大利亚降雨,我试图预测 RainTomorrow。这是我在下面给出的代码:
使用 opendatasets 库直接从 Kaggle 下载数据集
import opendatasets as od
dataset_url = 'https://www.kaggle.com/jsphyg/weather-dataset-rattle-package'
od.download(dataset_url)
导入必要的库
import os
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10,6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
加载数据集
data_dir = './weather-dataset-rattle-package'
os.listdir(data_dir)
train_csv = data_dir + '/weatherAUS.csv'
raw_df = pd.read_csv(train_csv)
探索 WindGustDir 变量
print('WindGustDir contains', len(raw_df['WindGustDir'].unique()), 'labels')
raw_df['WindGustDir'].unique()
raw_df.WindGustDir.value_counts()
pd.get_dummies(raw_df.WindGustDir, drop_first=True, dummy_na=True).head()
pd.get_dummies(raw_df.WindGustDir, drop_first=True, dummy_na=True).sum(axis=0)
工程分类变量中的缺失值
from sklearn.impute import SimpleImputer
cat_imputer = SimpleImputer(strategy='constant', fill_value='not available')
在我们执行插补之前,让我们检查一下。每个分类列中的缺失值。
raw_df[categorical].isna().sum()
Output
Location 0
WindGustDir 9163
WindDir9am 9660
WindDir3pm 3670
RainToday 0
dtype: int64
X_train[categorical].isna().sum() # checking missing value for training data and counting those values
Output
Location 0
WindGustDir 7285
WindDir9am 7754
WindDir3pm 2916
RainToday 0
dtype: int64
X_test[categorical].isna().sum() # checking missing value for test data and counting those values
Output
Location 0
WindGustDir 1878
WindDir9am 1906
WindDir3pm 754
RainToday 0
dtype: int64
插补的第一步是使插补器适应数据,即为数据集中的每一列计算选择的统计量。
cat_imputer.fit(raw_df[categorical])
SimpleImputer(fill_value='not available', strategy='constant')
categorical
['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
list(cat_imputer.statistics_)
['not available',
'not available',
'not available',
'not available',
'not available']
cat_imputer.transform(X_train[categorical])
array([['GoldCoast', 'NNE', 'SW', 'NNW', 'No'],
['Darwin', 'NW', 'NE', 'N', 'No'],
['Wollongong', 'SSE', 'SSW', 'SSE', 'No'],
...,
['MountGambier', 'E', 'E', 'NE', 'No'],
['Perth', 'W', 'WSW', 'WNW', 'Yes'],
['Wollongong', 'ESE', 'ESE', 'E', 'No']], dtype=object)
X_train[categorical] = cat_imputer.transform(X_train[categorical])
X_test[categorical] = cat_imputer.transform(X_test[categorical])
X_train[categorical].isna().sum() # now no missing values
Location 0
WindGustDir 0
WindDir9am 0
WindDir3pm 0
RainToday 0
dtype: int64
X_test[categorical].isna().sum() # now no missing values
Location 0
WindGustDir 0
WindDir9am 0
WindDir3pm 0
RainToday 0
dtype: int64
编码分类变量
categorical
['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
编码 RainToday 变量
import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['RainToday'])
X_train = encoder.fit_transform(X_train)
未来警告:消息
---------------------------------------------------------------------------
c:\python 3.9\lib\site-packages\category_encoders\utils.py:21: FutureWarning:
is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
为什么会有这个警告。如果避免警告会在将来运行我的代码时出现任何问题?