0

从数据集澳大利亚降雨,我试图预测 RainTomorrow。这是我在下面给出的代码:

使用 opendatasets 库直接从 Kaggle 下载数据集

import opendatasets as od  
dataset_url = 'https://www.kaggle.com/jsphyg/weather-dataset-rattle-package'
od.download(dataset_url)

导入必要的库

import os
import pandas as pd
import numpy as np

import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10,6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

加载数据集

data_dir = './weather-dataset-rattle-package'
os.listdir(data_dir)
train_csv = data_dir + '/weatherAUS.csv'
raw_df = pd.read_csv(train_csv)

探索 WindGustDir 变量

print('WindGustDir contains', len(raw_df['WindGustDir'].unique()), 'labels')
raw_df['WindGustDir'].unique()
raw_df.WindGustDir.value_counts()
pd.get_dummies(raw_df.WindGustDir, drop_first=True, dummy_na=True).head()
pd.get_dummies(raw_df.WindGustDir, drop_first=True, dummy_na=True).sum(axis=0)

工程分类变量中的缺失值

from sklearn.impute import SimpleImputer 
cat_imputer = SimpleImputer(strategy='constant', fill_value='not available')

在我们执行插补之前,让我们检查一下。每个分类列中的缺失值。

raw_df[categorical].isna().sum()
Output
Location          0
WindGustDir    9163
WindDir9am     9660
WindDir3pm     3670
RainToday         0
dtype: int64
X_train[categorical].isna().sum() # checking missing value for training data  and counting those values
Output
Location          0
WindGustDir    7285
WindDir9am     7754
WindDir3pm     2916
RainToday         0
dtype: int64
X_test[categorical].isna().sum() # checking missing value for test data  and counting those values
Output
Location          0
WindGustDir    1878
WindDir9am     1906
WindDir3pm      754
RainToday         0
dtype: int64

插补的第一步是使插补器适应数据,即为数据集中的每一列计算选择的统计量。

cat_imputer.fit(raw_df[categorical])

SimpleImputer(fill_value='not available', strategy='constant')

categorical

['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']

list(cat_imputer.statistics_)
['not available',
 'not available',
 'not available',
 'not available',
 'not available']
cat_imputer.transform(X_train[categorical])
array([['GoldCoast', 'NNE', 'SW', 'NNW', 'No'],
       ['Darwin', 'NW', 'NE', 'N', 'No'],
       ['Wollongong', 'SSE', 'SSW', 'SSE', 'No'],
       ...,
       ['MountGambier', 'E', 'E', 'NE', 'No'],
       ['Perth', 'W', 'WSW', 'WNW', 'Yes'],
       ['Wollongong', 'ESE', 'ESE', 'E', 'No']], dtype=object)
X_train[categorical] = cat_imputer.transform(X_train[categorical]) 
X_test[categorical] = cat_imputer.transform(X_test[categorical])

X_train[categorical].isna().sum() # now no missing values
Location       0
WindGustDir    0
WindDir9am     0
WindDir3pm     0
RainToday      0
dtype: int64

X_test[categorical].isna().sum() # now no missing values
Location       0
WindGustDir    0
WindDir9am     0
WindDir3pm     0
RainToday      0
dtype: int64

编码分类变量

categorical

['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']

编码 RainToday 变量

import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['RainToday'])

X_train = encoder.fit_transform(X_train)
  

未来警告:消息

---------------------------------------------------------------------------
c:\python 3.9\lib\site-packages\category_encoders\utils.py:21: FutureWarning:

is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead

为什么会有这个警告。如果避免警告会在将来运行我的代码时出现任何问题?

4

0 回答 0