r - 行有数据时从长到宽和重复列

Question

想知道其他人如何应对这一挑战。

背景

数据用于植被监测。它包括每个地块的基本信息，并确定这些物种的物种和覆盖率。

有几行特定于地块的信息 - 日期、位置、距离，然后是物种行。在物种行中，值包括该列所代表的地块中物种的覆盖百分比。

简化的视图将是这样的网格：

plot        1           4            5
date     5/3/2016     6/20/2016     6/22/2016
location    A           F             K
sp1                    15            30
sp2         5                        100
sp3         T           3             5

我希望得到的是这样的网格，它可以将 csv 导入数据库（物种 % 覆盖率需要参考 RMDB 中的绘图信息）。最左边的列 = 表字段名称。

plot        1        1          4        4            5          5       5
date     5/3/2016   5/3/2016  6/20/2016 6/20/2016 6/22/2016  6/22/2016 6/22/2016
location    A        A          F        F            K         K        K
species    sp2      sp3        sp1       sp3        sp1        sp2       sp3
cover %     5        T         15        3           30        100       5

这种宽格式可以很容易地被数据库“消化”并正确填充两个表（Plot 和 CoverPercent）。

方法？

我已经想到了几种方法，但我认为我缺少更好的方法。

到目前为止，这是我想出的：

将数据从长翻转到宽
添加一个species和cover行
计算给定地块的物种数量
根据物种数量重复绘图的列
填充情节的“物种”和“封面”行

最初我以为我可以在 VBA 中做到这一点，但似乎 R 是一种更好/更快/更清洁的方法。问题是“如何”？

我最近使用 table 包完成了一些 R 工作，但在过去的一年中，我确实对 VBA/SQL 项目感到生疏。

我很好奇其他人会如何处理这种变化。有什么想法吗？

score 1 · Accepted Answer

我会使用面向对象的方法来解决这个问题。定义一个简单的类，该类包含有关绘图和数据的信息，并具有物种和覆盖百分比的字典：

'Plot.cls
Option Explicit

Private Type PlotMembers
    PlotId As Long
    DataDate As Date
    Location As String
End Type

Private this As PlotMembers
Private mCover As Scripting.Dictionary

Private Sub Class_Initialize()
    Set mCover = New Scripting.Dictionary
End Sub

Public Property Get PlotId() As Long
    PlotId = this.PlotId
End Property

Public Property Let PlotId(inValue As Long)
    this.PlotId = inValue
End Property

Public Property Get DataDate() As Date
    DataDate = this.DataDate
End Property

Public Property Let DataDate(inValue As Date)
    this.DataDate = inValue
End Property

Public Property Get Location() As String
    Location = this.Location
End Property

Public Property Let Location(inValue As String)
    this.Location = inValue
End Property

Public Sub AddSpeciesCover(species As String, cover As String)
    mCover.Add species, cover
End Sub

然后给它一个属性，输出一个 CSV 数据行列表：

'Also in Plot.cls
Public Property Get CsvRows() As String
    Dim key As Variant
    Dim output() As String
    ReDim output(mCover.Count - 1)
    Dim i As Long
    For Each key In mCover.Keys
        Dim temp(4) As String
        temp(0) = this.PlotId
        temp(1) = this.DataDate
        temp(2) = this.Location
        temp(3) = key
        temp(4) = mCover(key)
        output(i) = Join(temp, ",")
        i = i + 1
    Next key
    CsvRows = Join(output, vbCrLf)
End Property

然后，您需要做的就是从输入数据中填充它们。请注意，此处的示例用法假定您问题中的顶部网格基本上看起来像左上角位于 A1 的活动工作表。更改它以匹配您收集数据的方式应该相当容易：

Public Sub SampleUsage()
    Dim plots As New Collection

    With ActiveSheet
        Dim col As Long
        For col = 2 To 4
            Dim current As Plot
            Set current = New Plot
            current.PlotId = .Cells(1, col).Value
            current.DataDate = .Cells(2, col).Value
            current.Location = .Cells(3, col).Value
            Dim r As Long
            For r = 4 To 6
                Dim cover As String
                cover = .Cells(r, col).Value
                If cover <> vbNullString Then
                    current.AddSpeciesCover .Cells(r, 1).Value, cover
                End If
            Next
            plots.Add current
        Next

    End With

    For Each current In plots
        Debug.Print current.CsvRows
    Next
End Sub

请注意，这只是演示该方法要点的框架 - 它需要错误处理、更健壮的格式等，等等，才能为生产做好准备。

score 1 · Accepted Answer

只需使用 reshape2 包的melt()方法在 R 中重塑您的数据框。下面假设您发布的数据的转置视图是您在评论中提到的实际格式：

library(reshape2)

data = 'plot    date    location    sp1 sp2 sp3
1   5/3/2016    A       5   T
4   6/20/2016   F   15      3
5   6/22/2016   K   30  100 5'

df <- read.table(text=data, header=TRUE, sep="\t", stringsAsFactors = FALSE)
df    
#   plot      date location sp1 sp2 sp3
# 1    1  5/3/2016        A  NA   5   T
# 2    4 6/20/2016        F  15  NA   3
# 3    5 6/22/2016        K  30 100   5

mdf <- melt(df, id.vars=c("plot", "date", "location"),
            variable.name="species", na.rm = TRUE, value.name="cover %")
mdf <- mdf[with(mdf, order(date)),]               # ORDER BY DATE
rownames(mdf) <- seq_len(nrow(mdf))               # RESET ROW NAMES
mdf

#   plot      date location species cover %
# 1    1  5/3/2016        A     sp2       5
# 2    1  5/3/2016        A     sp3       T
# 3    4 6/20/2016        F     sp1      15
# 4    4 6/20/2016        F     sp3       3
# 5    5 6/22/2016        K     sp1      30
# 6    5 6/22/2016        K     sp2     100
# 7    5 6/22/2016        K     sp3       5

r - 行有数据时从长到宽和重复列

2 回答 2

Related

Reference