1

我尝试找到类似的问题,并自己制定解决方案。但是,我不是很满意,所以决定在这里问这个问题。

目标: 我想使用 and 删除出现在字符串开头和结尾的一些表达式(" c (\"\"a\" ) 。regular expressionsgsub

#test strings 1 and 2
string1<- "c(\"can't remember the last time\" \"\\a\")"
string2<- "c(\"can't remember the last time\" \"a\")"

#Attempted solution for string1
string1<- gsub("^.\\(","",string1)
string1<- gsub("\\\\.","",string1)

#Result
string1
> "\"can't remember the last time\" \"\")"

问题 1:如何删除剩余的反斜杠而不遇到尾随反斜杠问题?我不能使用[[:punct:]]它,因为它也会删除其他标点符号。

#Attempted solution for string2
string2<- gsub("^.\\(","",string2)
string2<- gsub(".\\{1}","",string2)

#Result
string2
> "\"can't remember the last time\" \"a\")"

问题 2:如何删除 'a\' 表达式和剩余的反斜杠?

PS。通过使用 Java 将 Word 文档的表格中的数据导出到文本文件,然后将文本文件导入R. 但我只想看看如何regular expressions用来清理这个烂摊子,而不是发现导出数据的 Java 程序有问题。

谢谢。

编辑: 很抱歉没有把问题说清楚。这就是我希望最后一句话的样子:

"can't remember the last time"

第二次编辑

奇怪字符串的故事:上面显示的字符串是从我使用tm包构建的语料库中选择的,带有DirSource命令。原始文本以表格形式保存在 MS Word 中。我使用 Java 导出它为每个字符串创建文本文件,并将它们导入到 R。如果有帮助,dput 如下

structure(c("Can't remember the last time", 
"\a"), Author = character(0), DateTimeStamp = structure(list(
    sec = 40.6046140193939, min = 56L, hour = 13L, mday = 29L, 
    mon = 5L, year = 113L, wday = 6L, yday = 179L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "comment1.txt", Language = "english", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")) 
"\a"), Author = character(0), DateTimeStamp = structure(list(
    sec = 40.7186260223389, min = 56L, hour = 13L, mday = 29L, 
    mon = 5L, year = 113L, wday = 6L, yday = 179L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "comment99.txt", Language = "english", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character"))

我可以在上面的代码中看到“c(”和“\a”。

4

2 回答 2

3

如果开头和结尾的两个子字符串对于所有字符串都是固定的,则根本不需要正则表达式。只需使用substr

substr(string2, 4, nchar(string2) - 6)

如果末尾的子字符串是可变的,但只能包含反斜杠、双引号和a,则正则表达式为:

"[\\\\ \"a]*)$"

因此我们可以使用sub如下:

sub("[\\\\ \"a]*)$", "", substr(string1, 4, nchar(string1)))
于 2013-06-29T14:53:56.907 回答
2

正如@Mark Miller 指出的那样,您的问题不是很清楚。但我猜

library( stringr )
str_replace_all( string1, '\\"', "" )

解决您的第一个问题,然后

string2 <- str_replace_all( string2, '\\"a', "" )
str_replace_all( string2, '\\"', "" )
str_replace( str2, ')', "" )

第二。

于 2013-06-29T14:38:05.693 回答