0

我一直致力于提供数据的文本分析。通常分析包括在纸上对成绩单进行编码,然后将信息作为数字代码导入 R。我想输出单词的成绩单,上面的单词编号被切割成一定的线宽(让我们使用任意 80 个字符)。

一个最小的可视化示例:

#what we start with:

   person   text word.num
1    greg    The        1
2    greg    dog        2
3    greg   went        3
4    greg     to        4
5    greg    the        5
6    greg   zoo,        6
7    greg    but        7
8    greg    ate        8
9    greg first.        9
10  sally     He       10
11  sally  likes       11
12  sally  water       12
13  sally      a       13
14  sally    bit       14
15  sally   too.       15

#我想要什么:

1   2   3    4  5   6
The dog went to the zoo, 

7   8   9      10 11     
but ate first. He likes   

12    13  14  15
water a   bit too.  

当数字变大时会出现一个额外的问题,即较大的字数可能会超过一个短字,并且该字需要在其前面放置一个额外的空间。我认为通过确定最大数字的最大字符(数字)并在小于此数量的单词之后添加那么多空格,这在粘贴过程中很容易做到。

到目前为止,我解决这个问题的想法是:

  1. 为每行具有一定最大长度的字符向量创建一个 1 列矩阵(strwrap此处可能有用)
  2. 如上所述在短词后添加额外的空格(在nchar这里gsub可能有用)
  3. 通过使用字数统计函数确定伴随矩阵的数值,然后cumsum创建seq一个数值(实际上是字符)的伴随矩阵,该矩阵也是 1 列。这将逐行匹配字符(单词)矩阵。
  4. 现在这两个矩阵需要逐行交替(不知道该怎么做)
  5. 对齐单词上方的数字(不确定如何执行此操作,但nchar在这里可能有用)

我想将其保留在基本工具中,尽管我确信 HadelystringR会很有用,但我想避免这种依赖。

dput以上数据:

 dat <- structure(list(person = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,                           
     1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("greg", "sally"), class = "factor"),             
         text = structure(c(10L, 5L, 14L, 11L, 9L, 15L, 4L, 2L, 6L,                               
         7L, 8L, 13L, 1L, 3L, 12L), .Label = c("a", "ate", "bit",                                 
         "but", "dog", "first.", "He", "likes", "the", "The", "to",                               
         "too.", "water", "went", "zoo,"), class = "factor"), word.num = 1:15), row.names = c(NA, 
     -15L), .Names = c("person", "text", "word.num"), class = "data.frame")  

我无法设计一个标题,我觉得在未来的 SO 用户可以搜索的同时抓住了这个想法。请建议修改...

4

3 回答 3

3
> datmat <- matrix(c(1:length(dat$text), as.character(dat$text) ), nrow=2, byrow=TRUE)
> datmat
     [,1]  [,2]  [,3]   [,4] [,5]  [,6]   [,7]  [,8]  [,9]     [,10] [,11]   [,12]   [,13] [,14] [,15] 
[1,] "1"   "2"   "3"    "4"  "5"   "6"    "7"   "8"   "9"      "10"  "11"    "12"    "13"  "14"  "15"  
[2,] "The" "dog" "went" "to" "the" "zoo," "but" "ate" "first." "He"  "likes" "water" "a"   "bit" "too."
> options(width=30)
> datmat
     [,1]  [,2]  [,3]   [,4]
[1,] "1"   "2"   "3"    "4" 
[2,] "The" "dog" "went" "to"
     [,5]  [,6]   [,7]  [,8] 
[1,] "5"   "6"    "7"   "8"  
[2,] "the" "zoo," "but" "ate"
     [,9]     [,10] [,11]  
[1,] "9"      "10"  "11"   
[2,] "first." "He"  "likes"
     [,12]   [,13] [,14]
[1,] "12"    "13"  "14" 
[2,] "water" "a"   "bit"
     [,15] 
[1,] "15"  
[2,] "too."

可以通过强制转换为表类对象并使用 print.table 来删除引号:

> class(datmat) <- "table"
> datmat
     [,1] [,2] [,3] [,4] [,5]
[1,] 1    2    3    4    5   
[2,] The  dog  went to   the 
     [,6] [,7] [,8] [,9]  
[1,] 6    7    8    9     
[2,] zoo, but  ate  first.
     [,10] [,11] [,12] [,13]
[1,] 10    11    12    13   
[2,] He    likes water a    
     [,14] [,15]
[1,] 14    15   
[2,] bit   too. 

你也可以用这个做点什么。它修复了 Gavin 提到的左对齐问题:

> gsub("\\[.*\\,.*\\]", "", capture.output( print(datmat, quote=FALSE) ) )
 [1] "     "                    
 [2] " 1    2    3    4    5   "
 [3] " The  dog  went to   the "
 [4] "       "                  
 [5] " 6    7    8    9     "   
 [6] " zoo, but  ate  first."   
 [7] "     "                    
 [8] " 10    11    12    13   " 
 [9] " He    likes water a    " 
[10] "     "                    
[11] " 14    15   "             
[12] " bit   too. " 

还有进一步的改进:

datlines <- gsub("\\[.*\\,.*\\]", "", capture.output( print(datmat, quote=FALSE) ) )
for( i in seq_along(datlines)){ cat(datlines[i], "\n") }
 #----------------------------------#
 1    2    3    4    5    
 The  dog  went to   the  

 6    7    8    9      
 zoo, but  ate  first. 

 10    11    12    13    
 He    likes water a     

 14    15    
 bit   too. 
于 2012-11-06T19:19:27.017 回答
3

关于什么:

> tmp <- setNames(as.character(dat$text), dat$word.num)
> print(tmp, quote=FALSE)
     1      2      3      4    
 likes  water      a    bit   too.
> options(width = 80)
> print(tmp, quote=FALSE)
     1      2      3      4      5      6      7      8      9     10     11 
   The    dog   went     to    the   zoo,    but    ate first.     He  likes 
    12     13     14     15 
 water      a    bit   too. 

您可以将自己的类粘贴在对象上并添加打印方法:

class(tmp) <- "foo"
print.foo <- function(x, quote = FALSE, ...) {
  print(unclass(x), quote = quote, ...)
}
tmp

给予

> tmp
     1      2      3      4      5      6      7      8      9     10     11 
   The    dog   went     to    the   zoo,    but    ate first.     He  likes 
    12     13     14     15 
 water      a    bit   too.

将此表示转储到文件的一种方法是 via capture.output(),它有一个 file 参数:

capture.output(tmp, file = "foo.txt")

生成的文本文件包含:

$ cat foo.txt 
     1      2      3      4      5      6      7      8      9     10     11 
   The    dog   went     to    the   zoo,    but    ate first.     He  likes 
 water      a    bit   too.
    12     13     14     15 

它与您所拥有的不太一样 - 单词数字是右对齐的,但它很接近。

于 2012-11-06T19:27:14.347 回答
1

为了线程的完整性,我使用 DWin 的解决方案和一点 Gavin 的方法(作为函数):

numbtext <- function(text.var, width=80, txt.file = NULL) {
    zz <- matrix(c(1:length(text.var), as.character(text.var) ), 
        nrow=2, byrow=TRUE)
    OW <- options()$width
    options(width=width)
    dimnames(zz) <- list(c(rep("", nrow(zz))), c(rep("", ncol(zz))))
    print(zz, quote = FALSE)
    if (!is.null(txt.file)){
        sink(file=txt.file, append = TRUE) 
        print(zz, quote = FALSE)
        sink()
    }
    options(width=OW)
}

numbtext(dat$text, 40, "foo.txt")

产生:

 1   2   3    4  5   6    7   8  
 The dog went to the zoo, but ate

 9      10 11    12    13 14  15  
 first. He likes water a  bit too.
于 2012-11-06T22:21:09.927 回答