1

我喜欢功能工具,但我很难将其用于我的数据科学工作流程,因为我担心数据泄漏。

我认为防止这种情况的方法是在训练集上运行深度特征合成,然后将适当的值加入测试集,并仅在训练集中不存在的类别组上计算特征。

有没有更合适的处理泄漏的方法?

4

1 回答 1

2

Featuretools 尤其专注于帮助用户避免数据泄露或标签泄露。根据您是否有时间戳,有两种方法可以处理数据泄漏。

没有时间戳的数据

在没有时间戳的情况下,您可以EntitySet仅使用训练数据创建一个,然后运行ft.dfs​​. 这将只使用训练数据创建一个特征矩阵,但也会返回一个特征定义列表。接下来,您可以使用测试数据创建一个EntitySet,并通过调用ft.calculate_feature_matrix之前的特征定义列表来重新计算相同的特征。这是流程的样子

In [1]: import featuretools as ft

In [2]: es_train = ft.demo.load_mock_customer(return_entityset=True)

In [3]: feature_matrix, feature_defs = ft.dfs(entityset=es_train,
   ...:                                       target_entity="customers")
   ...: 

In [4]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)

In [5]: feature_matrix_enc
Out[5]: 
             zip_code = 02139  zip_code = 60091  zip_code = unknown  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount)  MODE(sessions.device) = desktop  MODE(sessions.device) = tablet  MODE(sessions.device) = mobile  MODE(sessions.device) = unknown                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id                                                                                                                                                                                                                                                                             ...                                                                                                                                                                                                                                                                                                                                                                                                                                          
1                           0                 1                   0                  131               10                  10236.77                                1                               0                               0                                0                   ...                                                     169.77                                 0.610052                                   41.95                               791.976505                              175.939423                                 9.299023                                 -0.377150                                5.857976                                        1                                -0.395358
2                           1                 0                   0                  122                8                   9118.81                                0                               0                               1                                0                   ...                                                     114.85                                 0.492531                                   42.96                               596.243506                              230.333502                                10.925037                                  0.962350                                7.420480                                        1                                -0.470007
3                           1                 0                   0                   78                5                   5758.24                                1                               0                               0                                0                   ...                                                      64.98                                 0.645728                                   21.77                               369.770121                              471.048551                                 9.819148                                 -0.244976                               12.537259                                        1                                -0.630425
4                           0                 1                   0                  111                8                   8205.28                                1                               0                               0                                0                   ...                                                      83.53                                 0.516262                                   17.27                               584.673126                              322.883448                                13.065436                                 -0.548969                               12.738488                                        1                                -0.497169
5                           1                 0                   0                   58                4                   4571.37                                0                               1                               0                                0                   ...                                                      73.09                                 0.830112                                   27.46                               313.448942                              198.522508                                 8.950528                                  0.098885                                5.599228                                        1                                -0.396571

[5 rows x 102 columns]

In [6]: es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)

In [7]: feature_matrix = ft.calculate_feature_matrix(features=features_enc, 
   ...:                                              entityset=es_test)

In [8]: feature_matrix
Out[8]: 
             zip_code = 02139  zip_code = 60091  zip_code = unknown  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount)  MODE(sessions.device) = desktop  MODE(sessions.device) = tablet  MODE(sessions.device) = mobile  MODE(sessions.device) = unknown                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id                                                                                                                                                                                                                                                                             ...                                                                                                                                                                                                                                                                                                                                                                                                                                          
1                       False              True               False                  108                7                   8298.18                            False                           False                            True                            False                   ...                                                     145.67                                 0.888409                                   40.48                               541.452307                              264.820242                                11.560551                                 -0.989418                               11.336633                                        1                                -0.193705
2                        True             False               False                   73                5                   5615.36                             True                           False                           False                            False                   ...                                                     106.27                                 0.471924                                   34.93                               380.553253                              420.418805                                 3.513896                                  1.030220                                7.908124                                        1                                -0.191482
3                       False              True               False                   96                7                   8135.65                            False                            True                           False                            False                   ...                                                     160.04                                 0.114599                                   48.71                               581.583008                              377.210618                                12.120119                                  0.130497                               12.869592                                        1                                -0.655836
4                       False              True               False                  140                9                  11240.85                             True                           False                           False                            False                   ...                                                     159.64                                 0.129480                                   29.87                               731.382339                              211.918894                                11.642241                                 -0.271928                                7.969242                                        1                                -0.652966
5                       False              True               False                   83                7                   6781.33                            False                           False                            True                            False                   ...                                                     149.95                                 0.587567                                   60.29                               527.818923                              535.839994                                19.134789                                 -1.195453                               26.460616                                        1                                -0.435026

[5 rows x 102 columns]

带时间戳的数据

如果您的数据有时间戳,防止泄漏的最佳方法是使用“截止时间”列表,该列表指定最后一个时间点数据允许用于生成的特征矩阵中的每一行。要使用截止时间,您需要为实体集中的每个时间敏感实体设置时间索引。

提示:即使您的数据没有时间戳,您也可以添加一个带有虚拟时间戳的列,Featuretools 可以将其用作时间索引。

当您调用 时ft.dfs,您可以像这样提供截止时间的数据框。

In [1]: import pandas as pd

In [2]: cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
   ...:                             "time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
   ...: 

In [3]: cutoff_times
Out[3]: 
   customer_id                time
0            1 2014-01-01 01:41:50
1            2 2014-01-01 02:06:50
2            3 2014-01-01 02:31:50
3            4 2014-01-01 02:56:50
4            5 2014-01-01 03:21:50

In [8]: feature_matrix, features = ft.dfs(entityset=es,
   ...:                                  target_entity="customers",
   ...:                                  cutoff_time=cutoff_times,
   ...:                                  cutoff_time_in_index=True)
   ...: 

In [9]: feature_matrix
Out[9]: 
                                zip_code  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount) MODE(sessions.device)  MIN(transactions.amount)  MAX(transactions.amount)  YEAR(join_date)  SKEW(transactions.amount)  DAY(join_date)                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id time                                                                                                                                                                                                                                                 ...                                                                                                                                                                                                                                                                                                                                                                                                                                          
1           2014-01-01 01:41:50    60091                   43                3                   3342.76               desktop                      5.60                    148.14             2008                  -0.024647               1                   ...                                                      22.99                                 0.219871                                    8.72                               238.078662                              155.824474                                 7.762885                              2.850032e-01                                5.224602                                      1.0                                -0.395358
2           2014-01-01 02:06:50    02139                   36                3                   2558.77               desktop                      6.29                    139.23             2008                   0.212373              20                   ...                                                      39.00                                 0.509707                                   25.28                               213.211299                              114.675523                                 4.898920                             -2.392117e-02                                7.035723                                      1.0                                 0.102851
3           2014-01-01 02:31:50    02139                   25                1                   2054.32                mobile                      8.70                    147.73             2008                  -0.215072              10                   ...                                                       8.70                                -0.215072                                    8.70                                82.172800                                0.000000                                 0.000000                              0.000000e+00                                0.000000                                      1.0                                -0.215072
4           2014-01-01 02:56:50    60091                    0                0                       NaN                   NaN                       NaN                       NaN             2008                        NaN              30                   ...                                                        NaN                                      NaN                                     NaN                                      NaN                                     NaN                                      NaN                                       NaN                                     NaN                                      NaN                                      NaN
5           2014-01-01 03:21:50    02139                   29                2                   2296.42                mobile                     20.91                    141.66             2008                   0.167792              19                   ...                                                      48.37                                 0.830112                                   27.46                               157.570000                              208.390000                                11.655000                              1.795202e-15                                4.470000                                      1.0                                -0.396571

[5 rows x 69 columns]

如您所见,结果特征矩阵中有一行是在每个指定的截止时间计算的!截止时间和时间指数的概念是 Featuretools 独特而强大的方面。有关更多信息,请阅读文档中的处理时间

于 2018-04-09T15:32:45.860 回答