0

Sorry for the "not really informative" title of this post. I have the following data set in SAS:

time Add    time_delete
5    3.00   5
5    3.15   11
5    3.11   11
8    4.21   8
8    3.42   8
8    4.20   11
11   3.12   .

Where the time correspond to a new added (Add) price in an auction at every 3minute. This price can get delete within the same time interval or later as shown in time_delete. My objective is to compute the average price from the Add field standing at every time. For instance, my average price at time=5 is (3.15+3.11)/2 since the 3.00 gets deleted within the interval. Then the average price standing at time=8 is (4.20+3.15+3.11)/3. As you can see, I have to look at the current time where I am standing and look back and see which price is still valid standing at time=8. Also, I would like to have a field where for every time I know the highest price available that was not deleted. Any help?

4

1 回答 1

2

You have a variant of a rolling sum here. There's no one straightforward solution (especially as you undoubtedly have a few complications not mentioned); but here are a few pointers.

First, you may want to change the format of your data. This is actually a relatively easy problem to solve if you have one row for each possible timepoint rather than just a single row.

data have;
input time Add    time_delete;
datalines;
5    3.00   5
5    3.15   11
5    3.11   11
8    4.21   8
8    3.42   8
8    4.20   11
11   3.12   .
;;;;
run;

data want;
set have;
if time=time_delete then delete;
else do time=time to time_delete-1;
  output;
end;
keep time add;
run;

proc means data=want mean max n;
class time;
var add;
run;

You could output the proc means to a dataset and have your maximum value plus the average value, and then either put that back on the main dataset or whatever you need.

The main downside to this is it's a much larger dataset, so if you're looking at hundreds of thousands of data points, this is not your best option likely.

You can also perform this in SQL without the extra rows, although this is where those "other complications" would potentially throw a wrench in things.

proc sql;
select H.time, mean(V.add), max(V.add) from (
    select distinct H.time from have H
    left join
    (select * from have) V
    on V.time le H.time
    and V.time_delete gt H.time )
    group by 1;
;
quit;

Fairly straightforward and quick query, except that if you have a lot of time values it might take some time to execute the join.

Other options:

  • Read the data into an array, with a second array tracking the delete points. This can get a bit complex as you probably need to sort your array by delete point - so rather than just adding a new record into the end, you need to move a bunch of records down. SAS isn't quite as friendly to this sort of operation as a c-type language would be.

  • Use a hash table solution. Somewhat less messy than an array, particularly as you can sort a hash table more easily than two separate arrays.

  • Use IML and vectors. Similar to the array solution but with more powerful manipulation techniques available.

于 2013-03-30T06:08:01.843 回答