我尝试找到类似的问题,并自己制定解决方案。但是,我不是很满意,所以决定在这里问这个问题。
目标:
我想使用 and 删除出现在字符串开头和结尾的一些表达式(" c (\"和\"a\" ) 。regular expressions
gsub
#test strings 1 and 2
string1<- "c(\"can't remember the last time\" \"\\a\")"
string2<- "c(\"can't remember the last time\" \"a\")"
#Attempted solution for string1
string1<- gsub("^.\\(","",string1)
string1<- gsub("\\\\.","",string1)
#Result
string1
> "\"can't remember the last time\" \"\")"
问题 1:如何删除剩余的反斜杠而不遇到尾随反斜杠问题?我不能使用[[:punct:]]
它,因为它也会删除其他标点符号。
#Attempted solution for string2
string2<- gsub("^.\\(","",string2)
string2<- gsub(".\\{1}","",string2)
#Result
string2
> "\"can't remember the last time\" \"a\")"
问题 2:如何删除 'a\' 表达式和剩余的反斜杠?
PS。通过使用 Java 将 Word 文档的表格中的数据导出到文本文件,然后将文本文件导入R
. 但我只想看看如何regular expressions
用来清理这个烂摊子,而不是发现导出数据的 Java 程序有问题。
谢谢。
编辑: 很抱歉没有把问题说清楚。这就是我希望最后一句话的样子:
"can't remember the last time"
第二次编辑
奇怪字符串的故事:上面显示的字符串是从我使用tm
包构建的语料库中选择的,带有DirSource
命令。原始文本以表格形式保存在 MS Word 中。我使用 Java 导出它为每个字符串创建文本文件,并将它们导入到 R。如果有帮助,dput 如下
structure(c("Can't remember the last time",
"\a"), Author = character(0), DateTimeStamp = structure(list(
sec = 40.6046140193939, min = 56L, hour = 13L, mday = 29L,
mon = 5L, year = 113L, wday = 6L, yday = 179L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "comment1.txt", Language = "english", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character"))
"\a"), Author = character(0), DateTimeStamp = structure(list(
sec = 40.7186260223389, min = 56L, hour = 13L, mday = 29L,
mon = 5L, year = 113L, wday = 6L, yday = 179L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "comment99.txt", Language = "english", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character"))
我可以在上面的代码中看到“c(”和“\a”。