1

我正在寻找一些想法/功能来改进我以非常低效的方式编写的任务。

我的原始数据框如下所示:

        CUSIP       Date        Pclose  TRI
1       30161N101   2011-01-03  38.8581 2011-01-03
2       06738G878   2011-01-03  48.4040 2011-01-03
3       74339G101   2011-01-03  24.0880 2011-01-03
4       74348A590   2011-01-03  81.7200 2011-01-03
5       26922W109   2011-01-03  87.8700 2011-01-03
...

“TRI”列包含日期格式的日期。

我想获得的看起来像:

    Date        233052109   126650100   251566105   149123101 ...
1   2011-01-03  22.8031     34.3034     11.2645      91.6178
2   2011-01-04  22.6843     34.2740     11.1862      91.1897
3   2011-01-05  22.7933     34.6362     10.9948      91.9779
4   2011-01-06  22.8034     34.2838     11.0470      91.0242
5   2011-01-07  22.6248     34.3034     11.0644      91.2091
.
.
.

在第二个数据框中,每一列(除了日期)都有第一个数据框中的 CUSIP 名称,并填充了来自 Pclose 的数据。

我正在使用常规循环制作我的第二个数据框,但我确信有一种方法可以使编译函数做得更好(也许是子集)

我的职能是:

要构建第二个数据框:

function(cusippresent,cusiplist){
    workinglist=list()
    workinglist[1]=as.character('Date')
    position = 2
    for (i in 2:length(cusippresent)) {
        if(cusippresent[i] %in% cusiplist) {
            workinglist[position]=as.character(cusippresent[i]);
            position=position+1
        }
    }

    rm(position)

    Data=data.frame()

    #On remplit la première ligne du dataframe
    #avec des éléments du bon type sinon il y a des problèmes

    Data[1,1]=as.character('1987-11-12')

    Data[1,2]=as.numeric(1)
    for (i in 2:length(workinglist)){Data[1,i]=as.numeric(1)}
    for (i in 1:length(workinglist)){colnames(Data)[i]=workinglist[i]}

    return(Data)    
}

要填充数据框:

function(DATA11C,TCusipC){
    nloop = 1
    positionorigine = 1
    positioncible = 1

    #copy of dates
    datelist = DATA11C[,"Date"]
    datelist = unique(datelist)
    for (i in 1:length(datelist)) {
        TCusipC[i,"Date"]=as.character(datelist[i])
    }

    #creation of needed columns
    TCusipC[,ncol(TCusipC)+1] = as.Date(TCusipC[,"Date"])
    colnames(TCusipC)[ncol(TCusipC)] = 'TRI'

    #ordering of tables
    DATA11C=DATA11C[with(DATA11C,order(TRI)),]
    TCusipC=TCusipC[with(TCusipC,order(TRI)),]

    longueur = nrow(TCusipC)

    #filling of the table 
    while(nloop<longueur) {
        while(DATA11C[positionorigine,"TRI"]==TCusipC[positioncible,"TRI"]){
            nom = as.character(DATA11C[positionorigine,"CUSIP"]);
            TCusipC[positioncible,nom]=as.numeric(DATA11C[positionorigine,"Pclose"]);
            positionorigine=positionorigine+1
        };
        nloop=nloop+1;
        positioncible=positioncible+1
    }

    return(TCusipC)
}

关于挖掘什么功能的任何建议?潜在的改进?

非常感谢,

文森特

4

2 回答 2

2

这正是 reshape 包所做的

library(reshape)
cast(Date ~ CUSIP, data = DATA11C, value = "Pclose")
于 2012-06-20T07:51:11.237 回答
0

这是基础 R 中的解决方案reshape()

dat = read.table(header=TRUE, text="        CUSIP       Date        Pclose  TRI
1       30161N101   2011-01-03  38.8581 2011-01-03
2       06738G878   2011-01-03  48.4040 2011-01-03
3       74339G101   2011-01-03  24.0880 2011-01-03
4       74348A590   2011-01-03  81.7200 2011-01-03
5       26922W109   2011-01-03  87.8700 2011-01-03")
reshaped.dat = reshape(dat, direction="wide", 
                       timevar="CUSIP", idvar="Date", 
                       drop="TRI")
names(reshaped.dat) = gsub("Pclose.", "", names(reshaped.dat))

输出:

reshaped.dat
#         Date 30161N101 06738G878 74339G101 74348A590 26922W109
# 1 2011-01-03   38.8581    48.404    24.088     81.72     87.87

更新:xtabs()方法

这也很容易使用xtabs()

xtabs(Pclose ~ Date + CUSIP, dat)
#             CUSIP
# Date         06738G878 26922W109 30161N101 74339G101 74348A590
#   2011-01-03   48.4040   87.8700   38.8581   24.0880   81.7200

as.data.frame.matrix(xtabs(Pclose ~ Date + CUSIP, dat))
#            06738G878 26922W109 30161N101 74339G101 74348A590
# 2011-01-03    48.404     87.87   38.8581    24.088     81.72
于 2012-06-20T08:46:43.240 回答