我正在尝试检测数据集中的多重共线性。我已经尝试了以下但收到错误
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
从 statsmodels.stats.outliers_influence 导入variance_inflation_factor
# creating dummies for gender
a2['y'] = a2['y'].map({'no':0, 'yes':1})
X=a2.iloc[:,0:20]
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
print(vif_data)
“Y”列是相关的,我将 No 映射到 0,将 Yes 映射到 1。数据集包含以下列:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 41188 non-null int64
1 job 41188 non-null object
2 marital 41188 non-null object
3 education 41188 non-null object
4 default 41188 non-null object
5 housing 41188 non-null object
6 loan 41188 non-null object
7 contact 41188 non-null object
8 month 41188 non-null object
9 day_of_week 41188 non-null object
10 duration 41188 non-null int64
11 campaign 41188 non-null int64
12 pdays 41188 non-null int64
13 previous 41188 non-null int64
14 poutcome 41188 non-null object
15 emp.var.rate 41188 non-null float64
16 cons.price.idx 41188 non-null float64
17 cons.conf.idx 41188 non-null float64
18 euribor3m 41188 non-null float64
19 nr.employed 7763 non-null float64
20 y 41188 non-null int64
dtypes: float64(5), int64(6), object(10)
memory usage: 6.6+ MB
nr.employee 已被删除,因为它包含 NaN 值。我是 Python/机器学习的新手。如果有人可以帮助我,那就太好了。我需要找到方差通货膨胀因素。