2

我有 1 分钟的股权数据如下;

                      bidopen    bidhigh    bidlow  bidclose bidvolume
currencypair
2007-03-30 16:01:00    1.9687    1.96900    1.9686    1.9686    877.40
2007-03-30 16:02:00    1.9686    1.96905    1.9686    1.9686    897.20
2007-03-30 16:03:00    1.9686    1.96900    1.9686    1.9690    1076.11
2007-03-30 16:04:00    1.9689    1.96910    1.9688    1.9690    849.70
2007-03-30 16:05:00    1.9690    1.96900    1.9688    1.9689    1402.80

我想添加一个额外的列。本专栏将:

  • 从此时起取 15 条记录(包括当前时间)
  • 从这 15 条记录中得到最大bidhigh和最小bidlow
  • 计算高低之差并在新列中使用该值

我尝试了以下。首先,我读入了数据。

usecols = ['datetime','bidopen','bidhigh','bidlow','bidclose','bidvolume']
df=pd.read_csv(path,parse_dates=('datetime'),index_col=0, usecols = usecols )

定义一个函数:

def lookaheadmaxmin(df):
    df2=df[:15]
    high=df2['bidhigh'].max(axis=1)
    low=df2['bidlow'].min(axis=1)
    return high-low

然后

df['newcolumn'] = map( lookaheadmaxmin, df[:15])

这个错误。我很确定地图中的 'df[:15]' 是问题所在,因为我不知道如何将当前和未来记录的一部分传递给函数

基本上我要做的是确定在 15 分钟的移动窗口内价格移动了多少,如下所示:

所以之间;16:00 - 16:15 - 价格变动了多少?将此放在 16:00 记录的列中

16:01 - 16:16 - 价格变动了多少?把它放在 16:01 记录的列中

16:02 - 16:17 - 价格变动了多少?将此放在 16:02 记录的列中

16:03 - 16:18 - 价格变动了多少?将此放在 16:03 记录的列中

16:04 - 16:19 - 价格变动了多少?把它放在 16:04 记录的列中

16:05 - 16:20 - 价格变动了多少?把它放在 16:05 记录的栏中


附加信息:

我正在为 Mac 使用 Enthought Canopy 版本 1.1.0(64 位)。Pandas 版本:版本:0.12.0-1(包含 numpy 1.7.1)

源数据样本:

    currencypair,datetime,bidopen,bidhigh,bidlow,bidclose,askopen,askhigh,asklow,askclose,bidvolume,askvolume
    GBPUSD,2007-03-30 16:01:00,1.96870,1.96900,1.96860,1.96860,1.96850,1.96880,1.96845,1.96850,877.40,1386.70
    GBPUSD,2007-03-30 16:02:00,1.96860,1.96905,1.96860,1.96860,1.96850,1.96890,1.96840,1.96840,897.20,1272.30
    GBPUSD,2007-03-30 16:03:00,1.96860,1.96900,1.96860,1.96900,1.96850,1.96890,1.96840,1.96880,1076.11,1333.30
    GBPUSD,2007-03-30 16:04:00,1.96890,1.96910,1.96880,1.96900,1.96880,1.96890,1.96865,1.96880,849.70,765.10
    GBPUSD,2007-03-30 16:05:00,1.96900,1.96900,1.96880,1.96890,1.96875,1.96890,1.96860,1.96870,1402.80,1240.90
    GBPUSD,2007-03-30 16:06:00,1.96890,1.96890,1.96840,1.96860,1.96870,1.96870,1.96820,1.96850,769.50,1727.30
    GBPUSD,2007-03-30 16:07:00,1.96860,1.96880,1.96820,1.96830,1.96850,1.96870,1.96810,1.96820,842.00,1865.60
    GBPUSD,2007-03-30 16:08:00,1.96830,1.96930,1.96830,1.96910,1.96820,1.96920,1.96820,1.96890,1096.60,1197.70
    GBPUSD,2007-03-30 16:09:00,1.96910,1.96920,1.96880,1.96890,1.96895,1.96910,1.96865,1.96880,368.60,432.10

作为旁注 - 记录的显示有些奇怪(我正在使用 ipython notebook )。即使我忽略了该'currencypair'列,它也奇怪地显示为列标题。(我将其包括在内,因为我不知道它是否与其他无法正常工作的事情有关。

导入数据(使用上面的 csv_read)(注意没有'currencypair'命名列)

usecols = ['datetime','bidopen','bidhigh','bidlow','bidclose','bidvolume']
df=pd.read_csv(path,parse_dates=('datetime'),index_col=0, usecols = usecols )

然后做

    df[:5]

显示:(请注意,它显示'currencypair'为列标题,但在df.info()下方仅显示为'index'

                           bidopen    bidhigh    bidlow    bidclose    bidvolume
    currencypair                    
    2007-03-30 16:01:00     1.9687     1.96900     1.9686     1.9686     877.40
    2007-03-30 16:02:00     1.9686     1.96905     1.9686     1.9686     897.20
    2007-03-30 16:03:00     1.9686     1.96900     1.9686     1.9690     1076.11
    2007-03-30 16:04:00     1.9689     1.96910     1.9688     1.9690     849.70
    2007-03-30 16:05:00     1.9690     1.96900     1.9688     1.9689     1402.80

df.info()显示:

    <class 'pandas.core.frame.DataFrame'>
    Index: 2362159 entries, 2007-03-30 16:01:00 to 2013-09-02 18:59:00
    Data columns (total 5 columns):
    bidopen      2362159  non-null values
    bidhigh      2362159  non-null values
    bidlow       2362159  non-null values
    bidclose     2362159  non-null values
    bidvolume    2362159  non-null values
    dtypes: float64(5)

以另一种方式导入数据

导入然后移除货币对列;(注意添加'currencypair'然后删除列之后)

    usecols = ['currencypair','datetime','bidopen','bidhigh','bidlow','bidclose','bidvolume']
    df=pd.read_csv(path,parse_dates=('datetime'),index_col=1, usecols = usecols )
    df=df.drop('currencypair',1)

显示:

                           bidopen    bidhigh    bidlow    bidclose    bidvolume
    datetime                    
    2007-03-30 16:01:00     1.9687     1.96900     1.9686     1.9686     877.40
    2007-03-30 16:02:00     1.9686     1.96905     1.9686     1.9686     897.20
    2007-03-30 16:03:00     1.9686     1.96900     1.9686     1.9690     1076.11
    2007-03-30 16:04:00     1.9689     1.96910     1.9688     1.9690     849.70
    2007-03-30 16:05:00     1.9690     1.96900     1.9688     1.9689     1402.80

df.info()显示:(注意索引现在显示为'DatetimeIndex'

    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 2362159 entries, 2007-03-30 16:01:00 to 2013-09-02 18:59:00
    Data columns (total 5 columns):
    bidopen      2362159  non-null values
    bidhigh      2362159  non-null values
    bidlow       2362159  non-null values
    bidclose     2362159  non-null values
    bidvolume    2362159  non-null values
    dtypes: float64(5)
4

1 回答 1

1

当您只需要指定几列时非常简单,例如 a 的最大值和 b 的最小值

In [65]: df = DataFrame(randn(100,4),columns=list('abcd'),
        index=date_range('20130101 16:00',periods=100,freq='T'))

In [66]: df.head(20)
Out[66]: 
                            a         b         c         d
2013-01-01 16:00:00  0.404056  0.115774 -0.202356  0.998315
2013-01-01 16:01:00 -0.231966  0.262609  1.192302 -0.702163
2013-01-01 16:02:00 -0.467005  0.744724 -0.871782 -0.308637
2013-01-01 16:03:00 -0.175704  0.036244  1.404604 -0.106320
2013-01-01 16:04:00  0.046306 -0.098140  0.535573 -0.306300
2013-01-01 16:05:00 -0.115620 -1.069991  0.790965 -0.504283
2013-01-01 16:06:00  1.496555  0.373582  1.028092 -0.816990
2013-01-01 16:07:00  0.432081  0.182106  0.115107  1.239192
2013-01-01 16:08:00 -0.245789 -2.030840  0.118330 -1.922616
2013-01-01 16:09:00 -0.358188 -0.121750  1.768505 -2.096908
2013-01-01 16:10:00 -1.634722 -0.808355 -0.773417  0.095078
2013-01-01 16:11:00 -0.396295  0.168568 -0.901945 -0.073811
2013-01-01 16:12:00 -1.364391  2.052481 -0.175291  0.927363
2013-01-01 16:13:00 -0.523331  0.042475  0.361593 -2.239468
2013-01-01 16:14:00  1.573967 -0.709043  0.551812  0.452311
2013-01-01 16:15:00  0.180578  0.846856 -2.304107 -1.283507
2013-01-01 16:16:00  0.065386  0.356015 -0.174369  1.167562
2013-01-01 16:17:00 -1.747416  1.279114  0.559075  0.200927
2013-01-01 16:18:00 -2.041764 -0.085398  2.032789  0.195671
2013-01-01 16:19:00 -0.639329  0.268832  0.394621 -0.271260

滚动函数从该点开始计算,因此我们时移(它只是更改索引)以便值对齐(与起点,而不是终点)

In [67]: df['max_a'] = pd.rolling_max(df['a'].tshift(-14),15)

In [68]: df['min_b'] = pd.rolling_min(df['b'].tshift(-14),15)

In [69]: df.head(20)
Out[69]: 
                            a         b         c         d     max_a     min_b
2013-01-01 16:00:00  0.404056  0.115774 -0.202356  0.998315  1.573967 -2.030840
2013-01-01 16:01:00 -0.231966  0.262609  1.192302 -0.702163  1.573967 -2.030840
2013-01-01 16:02:00 -0.467005  0.744724 -0.871782 -0.308637  1.573967 -2.030840
2013-01-01 16:03:00 -0.175704  0.036244  1.404604 -0.106320  1.573967 -2.030840
2013-01-01 16:04:00  0.046306 -0.098140  0.535573 -0.306300  1.573967 -2.030840
2013-01-01 16:05:00 -0.115620 -1.069991  0.790965 -0.504283  1.573967 -2.030840
2013-01-01 16:06:00  1.496555  0.373582  1.028092 -0.816990  1.573967 -2.030840
2013-01-01 16:07:00  0.432081  0.182106  0.115107  1.239192  1.573967 -2.030840
2013-01-01 16:08:00 -0.245789 -2.030840  0.118330 -1.922616  1.573967 -2.030840
2013-01-01 16:09:00 -0.358188 -0.121750  1.768505 -2.096908  1.573967 -1.185540
2013-01-01 16:10:00 -1.634722 -0.808355 -0.773417  0.095078  1.573967 -1.185540
2013-01-01 16:11:00 -0.396295  0.168568 -0.901945 -0.073811  1.573967 -1.185540
2013-01-01 16:12:00 -1.364391  2.052481 -0.175291  0.927363  1.573967 -1.185540
2013-01-01 16:13:00 -0.523331  0.042475  0.361593 -2.239468  1.573967 -1.185540
2013-01-01 16:14:00  1.573967 -0.709043  0.551812  0.452311  1.573967 -1.185540
2013-01-01 16:15:00  0.180578  0.846856 -2.304107 -1.283507  1.266667 -1.185540
2013-01-01 16:16:00  0.065386  0.356015 -0.174369  1.167562  1.266667 -1.563288
2013-01-01 16:17:00 -1.747416  1.279114  0.559075  0.200927  1.266667 -1.563288
2013-01-01 16:18:00 -2.041764 -0.085398  2.032789  0.195671  1.266667 -1.810085
2013-01-01 16:19:00 -0.639329  0.268832  0.394621 -0.271260  1.266667 -1.810085

嗨,低差异只是

df['max_a'] - df['min_b']

似乎您的系列中有空白,请使用asfreq

In [16]: df = DataFrame(randn(10,2),columns=list('ab'),index=date_range('20130101 9:00',freq='T',periods=10))

In [17]: df
Out[17]: 
                            a         b
2013-01-01 09:00:00  0.516518 -1.497564
2013-01-01 09:01:00  1.747399  1.100530
2013-01-01 09:02:00 -0.223476 -0.682712
2013-01-01 09:03:00  0.343172 -0.341965
2013-01-01 09:04:00 -1.380057 -1.565732
2013-01-01 09:05:00 -2.156675  1.043532
2013-01-01 09:06:00 -1.237155 -0.219086
2013-01-01 09:07:00  1.626510 -0.596204
2013-01-01 09:08:00 -0.767588  0.496110
2013-01-01 09:09:00 -0.014556  0.012049

In [18]: df.index
Out[18]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:00:00, ..., 2013-01-01 09:09:00]
Length: 10, Freq: T, Timezone: None

In [19]: df.append(Series(name=[Timestamp('20130101 09:15')]))
Out[19]: 
                            a         b
2013-01-01 09:00:00  0.516518 -1.497564
2013-01-01 09:01:00  1.747399  1.100530
2013-01-01 09:02:00 -0.223476 -0.682712
2013-01-01 09:03:00  0.343172 -0.341965
2013-01-01 09:04:00 -1.380057 -1.565732
2013-01-01 09:05:00 -2.156675  1.043532
2013-01-01 09:06:00 -1.237155 -0.219086
2013-01-01 09:07:00  1.626510 -0.596204
2013-01-01 09:08:00 -0.767588  0.496110
2013-01-01 09:09:00 -0.014556  0.012049
2013-01-01 09:15:00       NaN       NaN

In [20]: df.append(Series(name=[Timestamp('20130101 09:15')])).index
Out[20]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:00:00, ..., 2013-01-01 09:15:00]
Length: 11, Freq: None, Timezone: None

In [21]: df.append(Series(name=[Timestamp('20130101 09:15')])).asfreq('T')
Out[21]: 
                            a         b
2013-01-01 09:00:00  0.516518 -1.497564
2013-01-01 09:01:00  1.747399  1.100530
2013-01-01 09:02:00 -0.223476 -0.682712
2013-01-01 09:03:00  0.343172 -0.341965
2013-01-01 09:04:00 -1.380057 -1.565732
2013-01-01 09:05:00 -2.156675  1.043532
2013-01-01 09:06:00 -1.237155 -0.219086
2013-01-01 09:07:00  1.626510 -0.596204
2013-01-01 09:08:00 -0.767588  0.496110
2013-01-01 09:09:00 -0.014556  0.012049
2013-01-01 09:10:00       NaN       NaN
2013-01-01 09:11:00       NaN       NaN
2013-01-01 09:12:00       NaN       NaN
2013-01-01 09:13:00       NaN       NaN
2013-01-01 09:14:00       NaN       NaN
2013-01-01 09:15:00       NaN       NaN

In [22]: df.append(Series(name=[Timestamp('20130101 09:15')])).asfreq('T').index
Out[22]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:00:00, ..., 2013-01-01 09:15:00]
Length: 16, Freq: T, Timezone: None
于 2013-09-08T00:28:40.533 回答