我正在为当前行生成历史特征featuretools
。例如,会话期间最后一小时内进行的事务数。
包featuretools
包含参数cutoff_time
以排除cutoff_time
及时出现的所有行。
我设置cutoff_time
为time_index value - 1 second
,因此我希望这些功能基于历史数据减去当前行。这允许包括来自历史行的响应变量。
问题是,当这个参数不等于变量时,我会在原始特征和生成特征中time_index
得到一堆s。NaN
例子:
#!/usr/bin/env python3
import featuretools as ft
import pandas as pd
from featuretools import primitives, variable_types
data = ft.demo.load_mock_customer()
transactions_df = data['transactions']
transactions_df['cutoff_time'] = transactions_df['transaction_time'] - pd.Timedelta(seconds=1)
es = ft.EntitySet('transactions_set')
es.entity_from_dataframe(
entity_id='transactions',
dataframe=transactions_df,
variable_types={
'transaction_id': variable_types.Index,
'session_id': variable_types.Id,
'transaction_time': variable_types.DatetimeTimeIndex,
'product_id': variable_types.Id,
'amount': variable_types.Numeric,
'cutoff_time': variable_types.Datetime
},
index='transaction_id',
time_index='transaction_time'
)
es.normalize_entity(
base_entity_id='transactions',
new_entity_id='sessions',
index='session_id'
)
es.add_last_time_indexes()
fm, features = ft.dfs(
entityset=es,
target_entity='transactions',
agg_primitives=[primitives.Sum, primitives.Count],
trans_primitives=[primitives.Day],
cutoff_time=transactions_df[['transaction_id', 'cutoff_time']].
rename(index=str, columns={'transaction_id': 'transaction_id', 'cutoff_time': 'time'}),
training_window='1 hours',
verbose=True
)
print(fm)
输出(摘录):
DAY(cutoff_time) sessions.SUM(transactions.amount) \
transaction_id
352 NaN NaN
186 NaN NaN
319 NaN NaN
256 NaN NaN
449 NaN NaN
40 NaN NaN
13 NaN NaN
127 NaN NaN
21 NaN NaN
309 NaN NaN
列sessions.SUM(transactions.amount)
应该 >= 0。原始特征session_id product_id amount
也是NaN
如此。
如果transactions_df['cutoff_time'] = transactions_df['transaction_time']
(无时间增量),此代码有效,但包含当前行。
计算将从计算中排除当前行的聚合和转换的正确方法是什么?