0

我正在尝试使用后向差分编码技术来转换数据集中的所有分类变量以进行分析。

我尝试了它并且它有效,但我发现一些列应该在那里但丢失了。为什么他们缺少 a 以及如何解决它?

covid_df.info()

输出是:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100463 entries, 0 to 100462
Data columns (total 18 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   _id                     100463 non-null  int64 
 1   Assigned_ID             100463 non-null  int64 
 2   Outbreak Associated     100463 non-null  object
 3   Age Group               100463 non-null  object
 4   Neighbourhood Name      100463 non-null  object
 5   FSA                     100463 non-null  object
 6   Source of Infection     100463 non-null  object
 7   Classification          100463 non-null  object
 8   Episode Date            100463 non-null  object
 9   Reported Date           100463 non-null  object
 10  Client Gender           100463 non-null  object
 11  Outcome                 100463 non-null  object
 12  Currently Hospitalized  100463 non-null  object
 13  Currently in ICU        100463 non-null  object
 14  Currently Intubated     100463 non-null  object
 15  Ever Hospitalized       100463 non-null  object
 16  Ever in ICU             100463 non-null  object
 17  Ever Intubated          100463 non-null  object
dtypes: int64(2), object(16)
memory usage: 13.8+ MB

然后

import category_encoders as ce

encoder = ce.BackwardDifferenceEncoder(cols=['Source_Of_Infection'])
df_bd = encoder.fit_transform(covid_df)

df_bd.head()

输出是:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100463 entries, 0 to 100462
Data columns (total 59 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   intercept                       100463 non-null  int64         
 1   Assigned_Id                     100463 non-null  int64         
 2   Outbreak_Associated             100463 non-null  object        
 3   Age_Group                       100463 non-null  object        
 4   Neighbourhood_Name              100463 non-null  object        
 5   FSA                             100463 non-null  object        
 6   Source_Of_Infection_0           100463 non-null  float64       
 7   Source_Of_Infection_1           100463 non-null  float64       
 8   Source_Of_Infection_2           100463 non-null  float64       
 9   Source_Of_Infection_3           100463 non-null  float64       
 10  Source_Of_Infection_4           100463 non-null  float64       
 11  Source_Of_Infection_5           100463 non-null  float64       
 12  Source_Of_Infection_6           100463 non-null  float64       
 13  Source_Of_Infection_7           100463 non-null  float64       
 14  Classification                  100463 non-null  object        
 15  Episode_Date                    100463 non-null  datetime64[ns]
 16  Reported_Date                   100463 non-null  datetime64[ns]
 17  Gender                          100463 non-null  object        
 18  Outcome                         100463 non-null  object        
 19  Currently_Hospitalized          100463 non-null  object        
 20  Currently_ICU                   100463 non-null  object        
 21  Currently_Intubated             100463 non-null  object        
 22  Ever_Hospitalized               100463 non-null  object        
 23  Ever_ICU                        100463 non-null  object        
 24  Ever_Intubated                  100463 non-null  object        
 25  Outbreak_Outbreak_Associated    100463 non-null  uint8         
 26  Outbreak_Sporadic               100463 non-null  uint8         
 27  Source_Close_Contact            100463 non-null  uint8         
 28  Source_Community                100463 non-null  uint8         
 29  Source_Household_Contact        100463 non-null  uint8         
 30  Source_No_Information           100463 non-null  uint8         
 31  Source_Congregate_Settings      100463 non-null  uint8         
 32  Source_Healthcare_Institutions  100463 non-null  uint8         
 33  Source_Other_Settings           100463 non-null  uint8         
 34  Source_Pending                  100463 non-null  uint8         
 35  Source_Travel                   100463 non-null  uint8         
 36  Classification_Confirmed        100463 non-null  uint8         
 37  Classification_Probable         100463 non-null  uint8         
 38  Gender_Female                   100463 non-null  uint8         
 39  Gender_Male                     100463 non-null  uint8         
 40  NON-BINARY                      100463 non-null  uint8         
 41  Gender_Other                    100463 non-null  uint8         
 42  Gender_Transgender              100463 non-null  uint8         
 43  Gender_Unknown                  100463 non-null  uint8         
 44  Outcome_Active                  100463 non-null  uint8         
 45  Outcome_Fatal                   100463 non-null  uint8         
 46  Outcome_Resolved                100463 non-null  uint8         
 47  Currently_Hospitalized_No       100463 non-null  uint8         
 48  Currently_Hospitalized_Yes      100463 non-null  uint8         
 49  Currently_ICU_No                100463 non-null  uint8         
 50  Currently_ICU_Yes               100463 non-null  uint8         
 51  Currently_Intubated_No          100463 non-null  uint8         
 52  Currently_Intubated_Yes         100463 non-null  uint8         
 53  Ever_Hospitalized_No            100463 non-null  uint8         
 54  Ever_Hospitalized_Yes           100463 non-null  uint8         
 55  Ever_ICU_No                     100463 non-null  uint8         
 56  Ever_ICU_Yes                    100463 non-null  uint8         
 57  Ever_Intubated_No               100463 non-null  uint8         
 58  Ever_Intubated_Yes              100463 non-null  uint8         
dtypes: datetime64[ns](2), float64(8), int64(2), object(13), uint8(34)
memory usage: 22.4+ MB

结果列最初有 3 个值 RESOLVED、FATAL、RECOVERED 在我执行反向技术并看到结果后,结果列变成了 2 列而不是 3 列。为什么会发生这种情况以及如何解决?

4

0 回答 0