我正在努力尝试提前一天预测 EUR/USD 的收盘价,并且我已经创建了一个基本模型来开始使用管道。但是,结果好得令人难以置信,我确定我在某处有数据泄漏,但我找不到它。
以下是运行模型和创建管道的代码:
estimators = []
estimators.append(("strings_to_floats", StringToFloat(string_features)))
estimators.append(("series_supervised", SeriesToSupervised(n_in)))
estimators.append(("current_feature_remove",RemoveCurrentFeatures(features=current_features_remove)))
# Model pipeline
estimators.append(("SGD", SGDRegressor(max_iter=50000, tol=1e-3)))
model = Pipeline(estimators)
# Evaluate Pipeline
model.fit(train_X_, train_y_)
predictions = model.predict(test_X_)
代码在SeriesToSupervised
这里:
def series_to_supervised(self, data, n_in=5, n_out=1, dropnan=True):
"""
Frame a time series as a supervised learning dataset.
Arguments:
data: Sequence of observations as a list or NumPy array.
n_in: Number of lag observations as input (X).
n_out: Number of observations as output (y).
dropnan: Boolean whether or not to drop rows with NaN values.
Returns:
Pandas DataFrame of series framed for supervised learning.
"""
old_names = data.columns
n_vars = 1 if type(data) is list else data.shape[1]
df = pd.DataFrame(data, columns=old_names)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('%s(t-%d)' % (old_names[j], i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('%s' % (old_names[j])) for j in range(n_vars)]
else:
names += [('%s' % (old_names[j])) for j in range(n_vars)]
# Remove spaces in names
for i in range(len(names)):
names[i-1] = names[i-1].strip()
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
RemoveCurrentFeatures
只需遍历此列表: ["Open","High","Low","Change %","Price"] 并删除这些列。
数据集以上面列表中的列加上“日期”开始。在数据准备之后,数据框具有“价格(t-n_in)”形式的列,其中 n_in 是滞后数据的天数。
任何帮助将不胜感激,我已经坚持了一段时间,我确定这里有问题。
编辑:这是我进行测试和训练拆分的方式:
# Invert dataframe
data = data.iloc[::-1]
# Split each set into train and test sets
names = data.columns.values
dataFrame_train = pd.DataFrame(data[:int(data.shape[0]*train_test_split)], columns=names)
train_X = dataFrame_train#.iloc[:, 0:-1]
train_y = dataFrame_train["Price"]
train_y = train_y.tail(train_y.shape[0] - n_in)
dataFrame_test = pd.DataFrame(data[int(data.shape[0]*train_test_split):], columns=names)
test_X = dataFrame_test#.iloc[:, 0:-1]
test_y = dataFrame_test["Price"]
test_y = test_y.tail(test_y.shape[0] - n_in)
dataFrame_test = dataFrame_test.tail(dataFrame_test.shape[0] - n_in)