2

我正在分析一所学校的学生成绩单数据库。我的数据集包含大约 3000 条记录,其结构类似于下面的示例。每一次观察都是一位老师对一位学生的评价。每个观察都包含一个三句话的叙述性评论。

为了分享我的分析结果,我想从评论中删除提及学生姓名的内容,并将其替换为其他姓名。在一个理想的世界中,为了可重复性,我还想分享一个匿名版本的数据库。

学生姓名的不一致使用(名字与昵称与全名)以及学生姓名的非结构化使用对于像我这样的业余爱好者来说非常棘手。我解决这个问题的尝试是将评论作为语料库中的文档处理,并使用编写一个使用tm::removeWords但对我不起作用的函数。提前致谢!

示例数据(此处表的输入)

  Teacher Subject      Student.Name                                                         Comment
1   Black    Math    Richard (Dick) Dick is a terrible student-- why hasn't he been kicked out yet?
2   Black    Math Elizabeth (Betty)                       Betty procrastinates, but does good work.
3   Black    Math   Mary Grace (MG)                      As her teacher, I think MG is my favorite.
4   Brown English    Richard (Dick)                      Richard is terrible at turning in homework.
5   Brown English Elizabeth (Betty)                Elizabeth's work is interfering with her studies.
6   Brown English   Mary Grace (MG)                         Mary Grace should be a teacher someday.
7    Blue    P.E.    Richard (Dick)  Richard (Dick) kicked more field goals than any other student.
8    Blue    P.E. Elizabeth (Betty)    Elizabeth (Betty) needs to work to communicate on the field.
9    Blue    P.E.   Mary Grace (MG)             Mary Grace (MG) needs to stop insulting the teacher

所需数据

Teacher Subject Student Name    Comment
Black   Math    A   A is a terrible student-- why hasn't he been kicked out yet?
Black   Math    B   B procrastinates, but does good work.
Black   Math    C   As her teacher, I think C is my favorite.
Brown   English A   A is terrible at turning in homework
Brown   English B   B's work is interfering with her studies.
Brown   English C   C should be a teacher someday.
Blue    P.E.    A   A kicked more field goals than any other student.
Blue    P.E.    B   B needs to work to communicate on the field.
Blue    P.E.    C   C needs to stop insulting the teacher

注意

四个月前,我问了这个问题的一个版本,没有得到答复。我认为这将有助于展示我的解决方案,但也许该tm软件包并未广泛使用。所以这里是另一个镜头。

4

2 回答 2

2

我会mgsubqdap包中使用这里。你可以做这样的事情(虽然要注意确保学生被赋予相同的 id,这可能对你的例子来说太具体了,其中包含每个学生的昵称):

names <- unique(as.character(reports$Student.Name))
ids <- sample(100000, length(names))

tocheck <- c(
  names, 
  unlist(regmatches(names, gregexpr("(?<=\\().*?(?=\\))", names, perl = T))),
  gsub("\\s*\\([^\\)]+\\)","",as.character(names))
)
reports$Student.Name <- rep(ids, 3)
reports$Comment <- qdap::mgsub(tocheck, rep(ids, 3), reports$Comment)

  Student.Name                                                          Comment
1        61034 61034 is a terrible student-- why hasn't he been kicked out yet?
2        45005                        45005 procrastinates, but does good work.
3        13699                    As her teacher, I think 13699 is my favorite.
4        61034                         61034 is terrible at turning in homework
5        45005                    45005's work is interfering with her studies.
6        13699                               13699 should be a teacher someday.
7        61034            61034 kicked more field goals than any other student.
8        45005                 45005 needs to work to communicate on the field.
9        13699                        13699 needs to stop insulting the teacher
于 2016-09-05T23:00:56.443 回答
1

我不认为有一个简单的一刀切的解决方案。我可能会尝试正则表达式。

## load dput data
#eval(parse(text=paste0(readLines("http://pastebin.com/raw/MbghGybd", warn = F), collapse="\n")))

# anonymize:
r <-  regexec("(\\w+)\\s(?:(\\w+)\\s)?\\((\\w+)\\)", levels(reports$Student.Name))
m <- regmatches(levels(reports$Student.Name), r)
names(m) <- levels(reports$Student.Name)
m <- lapply(m, function(x) { 
  paste(sprintf("%s\\s*\\(%s\\)", x[2], x[4]), sprintf("%s %s \\(%s\\)", x[2], x[3], x[4]), x[2], x[4], paste(x[2], x[3], sep=" "), sep="|")
})
rep <- split(reports, reports$Student.Name)
for (x in seq_along(names(rep))) {
  rep[[x]]$Comment <-  gsub(m[[names(rep)[x]]], x, rep[[x]]$Comment, perl=TRUE)
}
transform(do.call(rbind, rep), Student.Name=as.integer(Student.Name))
#                     Teacher Subject Student.Name                                                      Comment
# Elizabeth (Betty).2   Black    Math            1                        1 procrastinates, but does good work.
# Elizabeth (Betty).5   Brown English            1                    1's work is interfering with her studies.
# Elizabeth (Betty).8    Blue    P.E.            1                 1 needs to work to communicate on the field.
# Mary Grace (MG).3     Black    Math            2                    As her teacher, I think 2 is my favorite.
# Mary Grace (MG).6     Brown English            2                         2 Grace should be a teacher someday.
# Mary Grace (MG).9      Blue    P.E.            2                        2 needs to stop insulting the teacher
# Richard (Dick).1      Black    Math            3 3 is a terrible student-- why hasn't he been kicked out yet?
# Richard (Dick).4      Brown English            3                         3 is terrible at turning in homework
# Richard (Dick).7       Blue    P.E.            3            3 kicked more field goals than any other student.

但这肯定需要大量调整才能使您的真实数据集成形。

于 2016-09-05T21:51:48.207 回答