我正在尝试使用后向差分编码技术来转换数据集中的所有分类变量以进行分析。
我尝试了它并且它有效,但我发现一些列应该在那里但丢失了。为什么他们缺少 a 以及如何解决它?
covid_df.info()
输出是:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100463 entries, 0 to 100462
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _id 100463 non-null int64
1 Assigned_ID 100463 non-null int64
2 Outbreak Associated 100463 non-null object
3 Age Group 100463 non-null object
4 Neighbourhood Name 100463 non-null object
5 FSA 100463 non-null object
6 Source of Infection 100463 non-null object
7 Classification 100463 non-null object
8 Episode Date 100463 non-null object
9 Reported Date 100463 non-null object
10 Client Gender 100463 non-null object
11 Outcome 100463 non-null object
12 Currently Hospitalized 100463 non-null object
13 Currently in ICU 100463 non-null object
14 Currently Intubated 100463 non-null object
15 Ever Hospitalized 100463 non-null object
16 Ever in ICU 100463 non-null object
17 Ever Intubated 100463 non-null object
dtypes: int64(2), object(16)
memory usage: 13.8+ MB
然后
import category_encoders as ce
encoder = ce.BackwardDifferenceEncoder(cols=['Source_Of_Infection'])
df_bd = encoder.fit_transform(covid_df)
df_bd.head()
输出是:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100463 entries, 0 to 100462
Data columns (total 59 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 intercept 100463 non-null int64
1 Assigned_Id 100463 non-null int64
2 Outbreak_Associated 100463 non-null object
3 Age_Group 100463 non-null object
4 Neighbourhood_Name 100463 non-null object
5 FSA 100463 non-null object
6 Source_Of_Infection_0 100463 non-null float64
7 Source_Of_Infection_1 100463 non-null float64
8 Source_Of_Infection_2 100463 non-null float64
9 Source_Of_Infection_3 100463 non-null float64
10 Source_Of_Infection_4 100463 non-null float64
11 Source_Of_Infection_5 100463 non-null float64
12 Source_Of_Infection_6 100463 non-null float64
13 Source_Of_Infection_7 100463 non-null float64
14 Classification 100463 non-null object
15 Episode_Date 100463 non-null datetime64[ns]
16 Reported_Date 100463 non-null datetime64[ns]
17 Gender 100463 non-null object
18 Outcome 100463 non-null object
19 Currently_Hospitalized 100463 non-null object
20 Currently_ICU 100463 non-null object
21 Currently_Intubated 100463 non-null object
22 Ever_Hospitalized 100463 non-null object
23 Ever_ICU 100463 non-null object
24 Ever_Intubated 100463 non-null object
25 Outbreak_Outbreak_Associated 100463 non-null uint8
26 Outbreak_Sporadic 100463 non-null uint8
27 Source_Close_Contact 100463 non-null uint8
28 Source_Community 100463 non-null uint8
29 Source_Household_Contact 100463 non-null uint8
30 Source_No_Information 100463 non-null uint8
31 Source_Congregate_Settings 100463 non-null uint8
32 Source_Healthcare_Institutions 100463 non-null uint8
33 Source_Other_Settings 100463 non-null uint8
34 Source_Pending 100463 non-null uint8
35 Source_Travel 100463 non-null uint8
36 Classification_Confirmed 100463 non-null uint8
37 Classification_Probable 100463 non-null uint8
38 Gender_Female 100463 non-null uint8
39 Gender_Male 100463 non-null uint8
40 NON-BINARY 100463 non-null uint8
41 Gender_Other 100463 non-null uint8
42 Gender_Transgender 100463 non-null uint8
43 Gender_Unknown 100463 non-null uint8
44 Outcome_Active 100463 non-null uint8
45 Outcome_Fatal 100463 non-null uint8
46 Outcome_Resolved 100463 non-null uint8
47 Currently_Hospitalized_No 100463 non-null uint8
48 Currently_Hospitalized_Yes 100463 non-null uint8
49 Currently_ICU_No 100463 non-null uint8
50 Currently_ICU_Yes 100463 non-null uint8
51 Currently_Intubated_No 100463 non-null uint8
52 Currently_Intubated_Yes 100463 non-null uint8
53 Ever_Hospitalized_No 100463 non-null uint8
54 Ever_Hospitalized_Yes 100463 non-null uint8
55 Ever_ICU_No 100463 non-null uint8
56 Ever_ICU_Yes 100463 non-null uint8
57 Ever_Intubated_No 100463 non-null uint8
58 Ever_Intubated_Yes 100463 non-null uint8
dtypes: datetime64[ns](2), float64(8), int64(2), object(13), uint8(34)
memory usage: 22.4+ MB
结果列最初有 3 个值 RESOLVED、FATAL、RECOVERED 在我执行反向技术并看到结果后,结果列变成了 2 列而不是 3 列。为什么会发生这种情况以及如何解决?