r - 从带有嵌入列表的 CSV 文件创建数据框

Question

我对 R 还是很陌生，可能已经完全搞砸了数据框的概念。

但我有一个格式如下的 csv 文件：

ID;Year;Title;Authors;Keywords;

Authors 和 Keywords 应该是一个字符串列表。例如

1；2013；迈向基于 SOA 和云的动态非侵入式健康监控；Mohammed Serhani、Abdelghani Benharret、Erlabi Badidi；电子健康、疾病、监控、预防、SOA、云、平台、m-tech；

有没有办法将此 csv 文件读入 R，以便将 Authors 和 Keywords 的数据框列构建为列表列表？这是否需要我以特定方式格式化 csv 文件？

使用以下选项读取 csv

articles <- read.csv(file="ls.csv",head=TRUE,sep=";",stringsAsFactors=F)

将 Authors 列生成为包含字符实例的列表。但我想要实现的是在 Authors 列的每个字段中获取字符列表。

score 3 · Accepted Answer

Are you saying that your file contains five variables (ID, year, title, authors, keywords) that are separated by semicolons? Then, by definition, it's not a csv file! Remember that csv stands for comma-separated values. Somebody screwed up by naming it as such.

You can read arbitrarily-delimited data using read.table:

articles <- read.table("ls.csv", header=TRUE, sep=";", stringsAsFactors=FALSE)

score 0 · Accepted Answer

就像 Hong Ooi 指出的那样，您的字段由“;”分隔，而不是“，”。函数read.csv具有默认值sep=","而read.csv2具有默认值sep=";" . 如果我理解正确，您的字段作者和关键字由“，”分隔，您也希望将它们分开。

我认为您不能在 data.frame 中的Authors和Keywords列中拥有列表类型的项目，因为 data.frame 的列不能是列表。如果给 data.frame 一个列表，它会被分解为它的列组件。在您的情况下，它将不起作用，因为会有不同数量的作者和/或关键字：

# Works
data.frame(a=list(first=1:3, second=letters[1:3]), b=list(first=4:6, second=LETTERS[1:3]))
#  a.first a.second b.first b.second
#1       1        a       4        A
#2       2        b       5        B
#3       3        c       6        C

# Does not work
data.frame(a=list(first=1:3, second=letters[1:2]), b=list(first=4:6, second=LETTERS[1:6]))
#Error in data.frame(first = 1:3, second = c("a", "b"), check.names = FALSE,  : 
#  arguments imply differing number of rows: 3, 2

但由于列表可能包含列表，您可以尝试将数据框分解为此类。“example.txt”的内容：

ID;Year;Title;Authors;Keywords;
1;2013;Towards Dynamic Non-obtrusive Health Monitoring Based on SOA and Cloud;Mohammed Serhani, Abdelghani Benharret, Erlabi Badidi;E-health, Diseases, Monitoring, Prevention, SOA, Cloud, Platform, m-tech;
2;1234;Title2;Author1, Author2;Key1, Key2, Key3;
3;5678;Title3;Author3, Author4, Author5;Key1, Key2, Key4;

以下是如何执行此操作的示例：

x <- scan("example.txt", what="", sep="\n", strip.white=TRUE)
y <- strsplit(x, ";")
# Leave out the header
dat <- y[-1]

# Apply a function to every element inside the highest level list
dat <- lapply(dat, 
    FUN=function(x) {
        # Splits in authors and keywords list
        ret <- strsplit(x, ",");
        # Remove leading and trailing whitespace
        ret <- lapply(ret, FUN=function(z) gsub("(^ +)|( +$)", "", z));
        # Assign names to all the fields
        names(ret)<-unlist(y[1]); 
        ret
    }
)

输出：

> str(dat)
List of 3
 $ :List of 5
  ..$ ID      : chr "1"
  ..$ Year    : chr "2013"
  ..$ Title   : chr "Towards Dynamic Non-obtrusive Health Monitoring Based on SOA and Cloud"
  ..$ Authors : chr [1:3] "Mohammed Serhani" "Abdelghani Benharret" "Erlabi Badidi"
  ..$ Keywords: chr [1:8] "E-health" "Diseases" "Monitoring" "Prevention" ...
 $ :List of 5
  ..$ ID      : chr "2"
  ..$ Year    : chr "1234"
  ..$ Title   : chr "Title2"
  ..$ Authors : chr [1:2] "Author1" "Author2"
  ..$ Keywords: chr [1:3] "Key1" "Key2" "Key3"
 $ :List of 5
  ..$ ID      : chr "3"
  ..$ Year    : chr "5678"
  ..$ Title   : chr "Title3"
  ..$ Authors : chr [1:3] "Author3" "Author4" "Author5"
  ..$ Keywords: chr [1:3] "Key1" "Key2" "Key4"

# Keywords of first item
> dat[[1]]$Keywords
[1] "E-health"   "Diseases"   "Monitoring" "Prevention" "SOA"       
[6] "Cloud"      "Platform"   "m-tech"  

# Title of second item
> dat[[2]][[3]]
[1] "Title2"

# Traveling inside the list of lists, accessing the very last data element
> lastitem <- length(dat)
> lastfield <- length(dat[[lastitem]])
> lastkey <- length(dat[[lastitem]][[lastfield]])
> dat[[lastitem]][[lastfield]][[lastkey]]
[1] "Key4"

请注意，列表列表可能是在 R 中存储数据的一种低效方式，因此如果您有大量数据，您可能希望转向更有效的方法，例如，访问密钥是您的 ID 的关系数据库结构，假设它是唯一的.

r - 从带有嵌入列表的 CSV 文件创建数据框

2 回答 2

Related

Reference