1

我有一个语义标签/语义标签类别字段,以及一个源、日期和 ID 变量。我想将语义标签字段分解为相应的标签/标签类别,然后转置数据集。我已经完成了大部分代码,但我仍然坚持让 ID/Date/Source 变量列出我从标签类别/标签创建的矩阵。我以(制表符分隔)开头的数据示例如下:

ID  Source  Date  Semantic Tags
1 thestate  2013-01-18  Person:elizabeth colbert-busch, Organization:congress
2 abcnews4  2013-04-03  PoliticalEvent:congressional race, Person:colbert busch, topicname:politics
3 Politics  2013-04-02  Person:mark sanford, Person:elizabeth colbert busch, Person:colbert busch, Organization:republican party

我希望数据看起来像数据库格式(也是制表符分隔的):

ID  Source  Date  Tag Type  Tag
1 thestate  2013-01-18  Person  elizabeth colbert-busch
1 thestate  2013-01-18  Organization  congress
2 abcnews 2013-04-03  Political event congressional race
2 abcnews 2013-04-04  Person  colbert-busch
2 abcnews 2013-04-05  topicname politics
3 Politics  2013-04-02  person  mark sanford
3 Politics  2013-04-03  person  elizabeth colbert-busch
3 Politics  2013-04-04  organization  republican party

我在分离标签类型和标签时没有问题(thnx @Tyler Rinker 寻求帮助......),但是当我坚持让 ID、Source 和 Date 变量按列表重复标签类型/标签时我创建的矩阵。任何人都可以帮忙吗?我的代码如下:

et3 <- lapply(strsplit(as.character(et$Semantic.Tags), ","), function(x) gsub("^//s+|//s+$", "", x)) # break out semantic tags/tag type by comma

et3 <- lapply(et3, strsplit, ":(?!/)", perl=TRUE) # break on colon

我尝试复制其他三个变量的以下代码行是我遇到问题的地方:

Date <- rep(et$Date, seq_along(et3), sapply(et3, length))

ID <- rep(et$ID, seq_along(et3), sapply(et3, length)) # Note that if I don't use "et$ID", the IDs replicate without issue...

...对于变量 Source 也是如此。我收到的警告消息是:In rep(et$Date, seq_along(et3), sapply(et3, length)): first element used of 'length.out' argument.并且只有第一个值出现在输出中。如果我首先将 et3 列表绑定为矩阵,则会发生同样的问题。任何人都可以帮助在矩阵/列表中重复变量吗?我也尝试过使用转置命令,但我不知道如何处理我变成列表的标签。

感谢任何人的帮助。

4

1 回答 1

4
# 1. create a matrix containing the expanded information for each row
#
et3 <- lapply(et3, function(x) {xx <- do.call(rbind, x)
  colnames(xx) <- c('tag','value')
  xx})
 # 2. cycle through each row and recombine

 do.call(rbind, lapply(seq_len(nrow(edt)), 
    function(x) cbind(edt[x, 1:3, drop = FALSE], et3[[x]])))

数据表方法

# an alternative is to use data.table
library(data.table)
EDT <- data.table(edt)
# string processing
EDT[, sc := lapply(strsplit(as.character(Semantic.Tags), ","), function(x) gsub("^//s+|//s+$", "", x)) ]
 EDT[, et3 := lapply(et3, strsplit, ":(?!/)", perl=TRUE)]

# rapply and by to create data.table  
EDT[, list(tag = rapply(et3, classes = 'character', function(x)x[1]), 
           value = rapply(et3, classes = 'character', function(x)x[2])), 
      by = list(ID, Source,Date)]



   ID   Source       Date            tag                   value
1:  1 thestate 2013-01-18         Person elizabeth colbert-busch
2:  1 thestate 2013-01-18   Organization                congress
3:  2 abcnews4 2013-04-03 PoliticalEvent      congressional race
4:  2 abcnews4 2013-04-03         Person           colbert busch
5:  2 abcnews4 2013-04-03      topicname                politics
6:  3 Politics 2013-04-02         Person            mark sanford
7:  3 Politics 2013-04-02         Person elizabeth colbert busch
8:  3 Politics 2013-04-02         Person           colbert busch
9:  3 Politics 2013-04-02   Organization        republican party
于 2013-04-22T04:12:14.443 回答