python - Python - 如何标准化时间序列数据

Question

我有一个时间序列示例的数据集。我想计算各种时间序列示例之间的相似性，但是我不想考虑由于缩放引起的差异（即我想查看时间序列形状的相似性，而不是它们的绝对值）。因此，为此，我需要一种标准化数据的方法。也就是说，使所有时间序列示例都落在某个区域之间，例如 [0,100]。谁能告诉我如何在 python 中做到这一点

score 11 · Accepted Answer

给出的解决方案适用于非增量或非递减（固定）的系列。在金融时间序列（或任何其他有偏差的序列）中，给出的公式是不正确的。它应该首先被去除趋势或根据最新的 100-200 个样本执行缩放。
如果时间序列不是来自正态分布（如金融中的情况），则建议应用非线性函数（例如标准 CDF 函数）来压缩异常值。
Aronson 和 Masters 的书（Statistically sound Machine Learning for algorithmic trading）使用以下公式（基于 200 天块）：

V = 100 * N ( 0.5 ( X -F50)/(F75-F25)) -50

其中：
X：数据点
F50：最近 200 个点的平均值
F75：百分位数 75
F25：百分位数 25
N：正常 CDF

score 10 · Accepted Answer

假设您的时间序列是一个数组，请尝试以下操作：

(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())

这会将您的值限制在 0 和 1 之间

score 7 · Accepted Answer

在我之前的评论之后，这里是一个（未优化的）python 函数，它进行缩放和/或规范化：（它需要一个 pandas DataFrame 作为输入，并且它不检查它，因此如果提供另一个对象类型，它会引发错误.如果你需要使用列表或numpy.array你需要修改它。但你可以先将这些对象转换为pandas.DataFrame()。

这个函数很慢，所以建议只运行一次并存储结果。

    from scipy.stats import norm
    import pandas as pd

    def get_NormArray(df, n, mode = 'total', linear = False):
        '''
                 It computes the normalized value on the stats of n values ( Modes: total or scale ) 
                 using the formulas from the book "Statistically sound machine learning..."
                 (Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
                 It is modified to fit the data from -1 to 1 instead of -100 to 100
                 df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
                 n define the number of data points to get the mean and the quartiles for the normalization
                 modes: scale: scale, without centering. total:  center and scale.
         '''
        temp =[]

        for i in range(len(df))[::-1]:

            if i  >= n: # there will be a traveling norm until we reach the initian n values. 
                        # those values will be normalized using the last computed values of F50,F75 and F25
                F50 = df[i-n:i].quantile(0.5)
                F75 =  df[i-n:i].quantile(0.75)
                F25 =  df[i-n:i].quantile(0.25)

            if linear == True and mode == 'total':
                 v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5
            elif linear == True and mode == 'scale':
                 v =  0.25 * df.iloc[i]/(F75-F25) -0.5
            elif linear == False and mode == 'scale':
                 v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5

            else: # even if strange values are given, it will perform full normalization with compression as default
                v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5

            temp.append(v[0])
        return  pd.DataFrame(temp[::-1])

score 0 · Accepted Answer

我不会给出 Python 代码，但归一化的定义是，对于每个值（数据点），您计算“（值均值）/stdev”。您的值不会介于 0 和 1（或 0 和 100）之间，但我认为这不是您想要的。您想比较变化。如果你这样做，你会剩下什么。

score 0 · Accepted Answer

from sklearn import preprocessing
normalized_data = preprocessing.minmax_scale(data)

你可以看看这里normalize-standardize-time-series-data-python 和 sklearn.preprocessing.minmax_scale

python - Python - 如何标准化时间序列数据

5 回答 5

Related

Reference