5

Pandas 新手,正在寻找最有效的方法。

我有一系列数据框。每个 DataFrame 具有相同的列但不同的索引,并且它们按日期索引。该系列由股票代码索引。因此,序列中的每一项都代表每只股票表现的单个时间序列。

我需要随机生成一个包含 n 个数据框的列表,其中每个数据框都是可用股票历史的一些随机分类的子集。如果有重叠也没关系,只要开始结束日期不同。

下面的代码可以做到,但它真的很慢,我想知道是否有更好的方法来解决它:

代码

def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
    if type(data) != pd.Series:
        return None

    if subset=='validate':
        offset = 0
    elif subset=='test':
        offset = 200
    elif subset=='train':
        offset = 400

    tickers = np.random.randint(0, len(data), size=len(data))

    ret_data = []
    while len(ret_data) != batch_size:
        for t in tickers:
            data_t = data[t]
            max_len = len(data_t)-timesteps-1
            if len(ret_data)==batch_size: break
            if max_len-offset < 0: continue

            index = np.random.randint(offset, max_len)
            d = data_t[index:index+timesteps]
            if len(d)==timesteps: ret_data.append(d)

    return ret_data

配置文件输出:

Timer unit: 1e-06 s

File: finance.py
Function: random_sample at line 137
Total time: 0.016142 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   137                                           @profile
   138                                           def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
   139         1            5      5.0      0.0      if type(data) != pd.Series:
   140                                                   return None
   141
   142         1            1      1.0      0.0      if subset=='validate':
   143                                                   offset = 0
   144         1            1      1.0      0.0      elif subset=='test':
   145                                                   offset = 200
   146         1            0      0.0      0.0      elif subset=='train':
   147         1            1      1.0      0.0          offset = 400
   148
   149         1         1835   1835.0     11.4      tickers = np.random.randint(0, len(data), size=len(data))
   150
   151         1            2      2.0      0.0      ret_data = []
   152         2            3      1.5      0.0      while len(ret_data) != batch_size:
   153       116          148      1.3      0.9          for t in tickers:
   154       116         2497     21.5     15.5              data_t = data[t]
   155       116          317      2.7      2.0              max_len = len(data_t)-timesteps-1
   156       116           80      0.7      0.5              if len(ret_data)==batch_size: break
   157       115           69      0.6      0.4              if max_len-offset < 0: continue
   158
   159       100          101      1.0      0.6              index = np.random.randint(offset, max_len)
   160       100        10840    108.4     67.2              d = data_t[index:index+timesteps]
   161       100          241      2.4      1.5              if len(d)==timesteps: ret_data.append(d)
   162
   163         1            1      1.0      0.0      return ret_data
4

1 回答 1

1

您确定需要找到更快的方法吗?您当前的方法并没有那么慢。以下更改可能会简化,但不一定会更快:

第 1 步:从数据框列表中抽取一个随机样本(带替换)

rand_stocks = np.random.randint(0, len(data), size=batch_size) 

您可以将此数组rand_stocks视为要应用于您的数据帧系列的索引列表。大小已经是批量大小,因此无需使用 while 循环和第 156 行的比较。

也就是说,您可以像这样迭代rand_stocks并访问股票:

for idx in rand_stocks: 
  stock = data.ix[idx] 
  # Get a sample from this stock. 

第 2 步:为您随机选择的每只股票获取一个随机数据范围。

start_idx = np.random.randint(offset, len(stock)-timesteps)
d = data_t[start_idx:start_idx+timesteps]

我没有你的数据,但我是这样整理的:

def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
    if subset=='train': offset = 0  #you can obviously change this back
    rand_stocks = np.random.randint(0, len(data), size=batch_size)
    ret_data = []
    for idx in rand_stocks:
        stock = data[idx]
        start_idx = np.random.randint(offset, len(stock)-timesteps)
        d = stock[start_idx:start_idx+timesteps]
        ret_data.append(d)
    return ret_data

创建数据集:

In [22]: import numpy as np
In [23]: import pandas as pd

In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H')
In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange)
In [26]: rndseries.head()
Out[26]:
2012-01-02    2.025795
2012-01-03    1.731667
2012-01-04    0.092725
2012-01-05   -0.489804
2012-01-06   -0.090041

In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]

测试功能:

In [42]: random_sample(data, timesteps=2, batch_size = 2)
Out[42]:
[2012-01-23    1.464576
2012-01-24   -1.052048,
 2012-01-23    1.464576
2012-01-24   -1.052048]
于 2012-11-05T21:10:18.563 回答