0

我正在努力尝试提前一天预测 EUR/USD 的收盘价,并且我已经创建了一个基本模型来开始使用管道。但是,结果好得令人难以置信,我确定我在某处有数据泄漏,但我找不到它。

以下是运行模型和创建管道的代码:

estimators = []
estimators.append(("strings_to_floats", StringToFloat(string_features)))
estimators.append(("series_supervised", SeriesToSupervised(n_in)))
estimators.append(("current_feature_remove",RemoveCurrentFeatures(features=current_features_remove)))

# Model pipeline
estimators.append(("SGD", SGDRegressor(max_iter=50000, tol=1e-3)))
model = Pipeline(estimators)
# Evaluate Pipeline
model.fit(train_X_, train_y_)
predictions = model.predict(test_X_)

代码在SeriesToSupervised这里:

def series_to_supervised(self, data, n_in=5, n_out=1, dropnan=True):
        """
        Frame a time series as a supervised learning dataset.
        Arguments:
            data: Sequence of observations as a list or NumPy array.
            n_in: Number of lag observations as input (X).
            n_out: Number of observations as output (y).
            dropnan: Boolean whether or not to drop rows with NaN values.
        Returns:
            Pandas DataFrame of series framed for supervised learning.
        """
        old_names = data.columns
        n_vars = 1 if type(data) is list else data.shape[1]
        df = pd.DataFrame(data, columns=old_names)
        cols, names = list(), list()
        # input sequence (t-n, ... t-1)
        for i in range(n_in, 0, -1):
            cols.append(df.shift(i))
            names += [('%s(t-%d)' % (old_names[j], i)) for j in range(n_vars)]
        # forecast sequence (t, t+1, ... t+n)
        for i in range(0, n_out):
            cols.append(df.shift(-i))
            if i == 0:
                names += [('%s' % (old_names[j])) for j in range(n_vars)]
            else:
                names += [('%s' % (old_names[j])) for j in range(n_vars)]
        # Remove spaces in names
        for i in range(len(names)):
            names[i-1] = names[i-1].strip()
        # put it all together
        agg = concat(cols, axis=1)
        agg.columns = names
        # drop rows with NaN values
        if dropnan:
            agg.dropna(inplace=True)
        return agg

RemoveCurrentFeatures只需遍历此列表: ["Open","High","Low","Change %","Price"] 并删除这些列。

数据集以上面列表中的列加上“日期”开始。在数据准备之后,数据框具有“价格(t-n_in)”形式的列,其中 n_in 是滞后数据的天数。

任何帮助将不胜感激,我已经坚持了一段时间,我确定这里有问题。

编辑:这是我进行测试和训练拆分的方式:

# Invert dataframe
    data = data.iloc[::-1]

    # Split each set into train and test sets
    names = data.columns.values

    dataFrame_train = pd.DataFrame(data[:int(data.shape[0]*train_test_split)], columns=names)
    train_X = dataFrame_train#.iloc[:, 0:-1]
    train_y = dataFrame_train["Price"]
    train_y = train_y.tail(train_y.shape[0] - n_in)

    dataFrame_test = pd.DataFrame(data[int(data.shape[0]*train_test_split):], columns=names)
    test_X = dataFrame_test#.iloc[:, 0:-1]
    test_y = dataFrame_test["Price"]
    test_y = test_y.tail(test_y.shape[0] - n_in)

    dataFrame_test = dataFrame_test.tail(dataFrame_test.shape[0] - n_in)
4

0 回答 0