r - 将文本数据读入 R

Question

我有一个由单个整数标识的单个向量 itemlist 文件中的项目列表。我也有每个项目的元数据。在这种情况下，该项目是 Amazon.com 上的一本书，元数据位于下面列出的各种属性中。对于我的项目列表中的每本书，我想获得它的标题、组、销售排名和其他一些。元数据包含其他组的数据，例如 DVD，但我不需要这些数据并想跳过它们。在元数据文件中，每个项目及其属性都以“ID：”开头，并以空行结束。我在 R 中尝试了一堆工具，但没有取得多大成功。并希望有人可以提供帮助。

这是元数据文件的摘录，适用于 2 本书（ID：9 和 ID：10）。

Id:   9
ASIN: 1859677800
  title: Making Bread: The Taste of Traditional Home-Baking
  group: Book
  salesrank: 949166
  similar: 0
  categories: 1
   |Books[283155]|Subjects[1000]|Cooking, Food & Wine[6]|Baking[4196]|Bread[4197]
  reviews: total: 0  downloaded: 0  avg rating: 0

Id:   10
ASIN: 0375709363
  title: The Edward Said Reader
  group: Book
  salesrank: 220379
  similar: 5  039474067X  0679730672  0679750541  1400030668  0896086704
  categories: 3
   |Books[283155]|Subjects[1000]|Literature & Fiction[17]|History & Criticism[10204]|Criticism & Theory[10207]|General[10213]
   |Books[283155]|Subjects[1000]|Nonfiction[53]|Politics[11079]|History & Theory[11086]
   |Books[283155]|Subjects[1000]|Nonfiction[53]|Social Sciences[11232]|Anthropology[11233]|Cultural[11235]
  reviews: total: 6  downloaded: 6  avg rating: 4
    2000-10-8  cutomer: A2RI73IFW2GWU1  rating: 4  votes:  12  helpful:   7
    2001-5-4  cutomer: A1GE54WF2WUZ2X  rating: 5  votes:  11  helpful:   8
    2001-8-27  cutomer: A36S399V1VC4DR  rating: 4  votes:   5  helpful:   3
    2002-1-26  cutomer: A280GY5UVUS2QH  rating: 3  votes:  12  helpful:   7
    2004-4-7  cutomer: A2YHZJIU4L4IOI  rating: 4  votes:  10  helpful:   2
    2004-4-27  cutomer: A1MB83EO48TRSC  rating: 4  votes:   5  helpful:   3

score 1 · Accepted Answer

假设发布的数据位于名为的文本文件中myfile.txt，将其缩减为可以使用的行，然后对其进行解析以生成长格式数据。添加一个grp标识来自相同 ID 的字段的列。可选择dcast在 reshape2 包中使用以将其从长形重塑为宽形：

library(reshape2)

L <- readLines("myfile.txt")

# add other fields to the regular expression as needed
ok <- grep("^Id:|^ *title:|^ *group:", L, value = TRUE)

# create data frame in long form
long <- data.frame(lab = gsub("^ *|:.*", "", ok), value = sub("^.*?: ", "", ok))
long$grp <- cumsum(long$lab == "Id")

# optionally reshape it into wide form
wide <- dcast(grp ~ lab, data = long)

最后一行给出：

> wide
  grp group   Id                                title
1   1  Book    9 The Taste of Traditional Home-Baking
2   2  Book   10               The Edward Said Reader

score 0 · Accepted Answer

如果您使用readLines，您可以将这些数据作为长字符串输入 R：

z <- readLines("example-text.txt")

然后，您可以使用此初始读入来单独读入每条记录，scan或者将该记录拆分为行。例如：

idpos <- grep("Id",z)
scan("example-text.txt", skip=idpos[1]-1, nlines=idpos[2]-idpos[1], what="character",sep="\n")
scan("example-text.txt", skip=idpos[2]-1, nlines=length(z)-idpos[2], what="character",sep="\n")

然后，您可以以各种方式解析这些字符串，以将它们转换为另一种数据结构。

r - 将文本数据读入 R

2 回答 2

Related

Reference