3

I have a need to process a sequence of historical tick data of millisecond timeframe. The ability is required to filter in opening ticks of certain timespans (hourly, minute, etc.). The sequence may have gaps greater, than the span, so the first tick after such gap must be picked as opening one, otherwise the opening tick is one that is closest to pass of calendar beginning of correspondent timespan.

The first thing that comes to my mind is the following stateful filtering function opensTimespan:Timespan->(Timestamp->bool) that captures timespanId of each gap-opening or interval-opening tick into a closure for passing between invocations:

let opensTimespan (interval: Timespan)=
    let lastTakenId = ref -1L  // Timestamps are positive
    fun (tickAt: Timestamp) -> 
        let tickId = tickAt / interval in
            if tickId <> !lastTakenId then lastTakenId := tickId; true
            else false

and can be applied like this:

let hourlyTicks = readTicks @"EURUSD-history.zip" "EURUSD-2012-04.csv"
                  |> Seq.filter (opensTimespan HOUR) |> Seq.toList

This works fine, but opensTimespan having the side effect is definitely not idiomatic.

One alternative may be using the fact that the decision upon a tick is opening one or not requires just the pair of timestamps of the self and the previous one to come up with the following stateless filtering function opensTimespanF:Timespan->Timestamp*Timestamp->bool:

let opensTimespanF interval (ticksPair: Timestamp*Timestamp) =
    fst ticksPair/ interval <> snd ticksPair/ interval

that can be applied as:

let hourlyTicks= 
    seq {
        yield 0L;
        yield! readTicks @"EURUSD-history.zip" "EURUSD-2012-04.csv"
    }
    |> Seq.pairwise |> Seq.filter (opensTimespanF HOUR)
    |> Seq.map snd
    |> Seq.toList

This approach being pure functional produces equivalent results with only a slight (~11%) performance penalty.

What other way(s) of approaching this task in pure functional manner I may be missing?

Thank you.

4

2 回答 2

5

就像 Tomas 的解决方案(实际上,我使用他作为我的起点,评论和所有),除了使用Seq.scan允许您避免List.rev并按需产生结果(例如,我们可以处理无限滴答流)。

let hourlyTicks = 
  readTicks @"EURUSD-history.zip" "EURUSD-2012-04.csv" 
  |> Seq.scan (fun (lastTakenId,_) tickAt ->
      // Similar to the body of your stateful function - 'lastTakenId' is the last state
      // and 'tickAt' is the current value.
      let tickId = tickAt / HOUR 
      if tickId <> lastTakenId then  
        // We return new state for 'lastTakenId' and yield current 
        // element to the "scan stream"
        (tickId, Some(tickAt))
      else 
        // Here, we skip element, so we return the original tick id and 
        // yield None to the "scan stream"
        (lastTakenId, None) ) (-1L, None) // Initial state: -1 and None

  //yield all the snd elements of the "scan stream" where Option.isSome
  |> Seq.choose snd

(免责声明:我没有对此进行测试,因为我没有在您的问题中假设所有依赖项)。

更新以回应评论

我想知道您看到的性能损失是否是由于对累加器中的值进行装箱/拆箱。我很想知道以下是否显示出改进:

open System
open System.Collections.Generic
let hourlyTicks3 = 
  readTicks @"EURUSD-history.zip" "EURUSD-2012-04.csv" 
  |> Seq.scan (fun (kvp:KeyValuePair<_,_>) tickAt ->
      let lastTakenId = kvp.Key
      // Similar to the body of your stateful function - 'lastTakenId' is the last state
      // and 'tickAt' is the current value.
      let tickId = tickAt / HOUR 
      if tickId <> lastTakenId then  
        // We return new state for 'lastTakenId' and yield current 
        // element to the "scan stream"
        KeyValuePair<_,_>(tickId, Nullable<_>(tickAt))
      else 
        // Here, we skip element, so we return the original tick id and 
        // yield "null" to the "scan stream"
        KeyValuePair<_,_>(lastTakenId, Nullable<_>()) ) (KeyValuePair<_,_>(-1L, Nullable<_>())) // Initial state: -1 and "null"
  //yield all Values of KeyValuePair.Value elements of the "scan stream" where Nullable.HasValue
  |> Seq.filter (fun kvp -> kvp.Value.HasValue)
  |> Seq.map (fun kvp -> kvp.Value.Value)
于 2012-05-22T16:25:19.913 回答
5

一个纯粹的功能解决方案是使用该fold功能。该fold函数用于处理序列(或列表)并累积一些状态。在您的示例中,状态是lastTakenId您想要返回的元素列表,因此您可以使用 state 类型Timestamp * (Timestamp list)

let hourlyTicks = 
  readTicks @"EURUSD-history.zip" "EURUSD-2012-04.csv" 
  |> Seq.fold (fun (lastTakenId, res) tickAt ->
      // Similar to the body of your stateful function - 'lastTakenId' is the last
      // state and 'tickAt' is the current value. The 'res' list stores 
      // all returned elements
      let tickId = tickAt / HOUR 
      if tickId <> lastTakenId then  
        // We return new state for 'lastTakenId' and append current element to result
        (tickId, tickAt::res)
      else 
        // Here, we skip element, so we return the original state and original list
        (lastTakenId, res) ) (-1L, []) // Initial state: -1 and empty list of results

  // Take the second part of the state (the result list) and
  // reverse it, because it was accumulated in the opposite order
  |> snd |> List.rev

另外,我不完全确定您的其他纯解决方案 - 我认为它与第一个解决方案不完全相同(但我没有要测试的数据),因为您只比较两个相邻的元素(所以,也许,在第一个中,您可以跳过多个项目?)

于 2012-05-22T16:01:28.847 回答