我正在分析一所学校的学生成绩单数据库。我的数据集包含大约 3000 条记录,其结构类似于下面的示例。每一次观察都是一位老师对一位学生的评价。每个观察都包含一个三句话的叙述性评论。
为了分享我的分析结果,我想从评论中删除提及学生姓名的内容,并将其替换为其他姓名。在一个理想的世界中,为了可重复性,我还想分享一个匿名版本的数据库。
学生姓名的不一致使用(名字与昵称与全名)以及学生姓名的非结构化使用对于像我这样的业余爱好者来说非常棘手。我解决这个问题的尝试是将评论作为语料库中的文档处理,并使用编写一个使用tm::removeWords
但对我不起作用的函数。提前致谢!
示例数据(此处表的输入)
Teacher Subject Student.Name Comment
1 Black Math Richard (Dick) Dick is a terrible student-- why hasn't he been kicked out yet?
2 Black Math Elizabeth (Betty) Betty procrastinates, but does good work.
3 Black Math Mary Grace (MG) As her teacher, I think MG is my favorite.
4 Brown English Richard (Dick) Richard is terrible at turning in homework.
5 Brown English Elizabeth (Betty) Elizabeth's work is interfering with her studies.
6 Brown English Mary Grace (MG) Mary Grace should be a teacher someday.
7 Blue P.E. Richard (Dick) Richard (Dick) kicked more field goals than any other student.
8 Blue P.E. Elizabeth (Betty) Elizabeth (Betty) needs to work to communicate on the field.
9 Blue P.E. Mary Grace (MG) Mary Grace (MG) needs to stop insulting the teacher
所需数据
Teacher Subject Student Name Comment
Black Math A A is a terrible student-- why hasn't he been kicked out yet?
Black Math B B procrastinates, but does good work.
Black Math C As her teacher, I think C is my favorite.
Brown English A A is terrible at turning in homework
Brown English B B's work is interfering with her studies.
Brown English C C should be a teacher someday.
Blue P.E. A A kicked more field goals than any other student.
Blue P.E. B B needs to work to communicate on the field.
Blue P.E. C C needs to stop insulting the teacher
注意
四个月前,我问了这个问题的一个版本,没有得到答复。我认为这将有助于展示我的解决方案,但也许该tm
软件包并未广泛使用。所以这里是另一个镜头。