0

我有一个可以查找和替换大约一百个术语的 applescript。使用正则表达式。我想在 R 中导入这个查找和替换函数。因此,在 ScriptEditor 中,我将 AppleScript 保存为文本文件并通过 readLines() 将其导入 R。此导入的 dput() 结果类似于下面的 punct.out。当我从原始向量而不是从导入创建自己的模式和替换数据框时(请参见下面的 punct),然后在测试字符串上的查找和替换(请参见下面的测试)工作得很好。但是,当我对导入的数据框尝试相同的命令时,它不起作用,它返回 NA。

所以不知何故,导入的文本结果并没有被解释为正则表达式或字符向量......我无法弄清楚。

#structure of my imported patterns and replacements
punct.out<-structure(list(replace = c(NA, NA, "good-bye[a-z]+|good-bye", 
"good bye[a-z]+|good bye", "good-", "ill at ease", "ill-", "-like", 
" well,", "- well,", ", well,", "as well", ".,", ".... well", 
"... well", ". Well,", ": well,", "well-", "well,", "well,", 
"well,", "Well,", "- okay,", ", okay,", "okay,", " okay,", ".... okay", 
"... okay", ". Okay,", ": okay,", "OK", "'okay,", "okay,", "Okay,", 
"Okay", ", too", "too /", "too,", "too.", "too?", "too:", "(No)(. )([0-    9]+)", 
"( [A-Z])(.)( )", "www.", "ain't", "let's", "won't", "can't", 
"n't", "cannot", "'d", "'ll", "'m", "'ve", "'re", "!", "?", ";", 
"", ",", "--", "-", "-", "é", "è", "à", "ç", "&", "%", "per cent", 
"_", "Que.", "Ont.", "Nfld.", "Alta.", "Man.", "Sask.", "St.", 
"Ste.", "i.e.", "Mr.", "Ms.", "Mrs.", "Prof.", ".com", "a. m.", 
"p. m.", "a.m.", "p.m.", "Jan.", "Feb.", "Mar.", "Apr.", "Jun.", 
"Jul.", "Aug.", "Sept.", "Oct.", "Nov.", "Dec.", "gen.", "Dr.", 
"e. coli", "(.)([A-Z])(.)", "([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", 
"([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", 
"([0-9])(.)([0-9])", "()(S)", "([a-z]+)(')", "(')([a-z]+)", "bull ' s eye", 
"no man ' s land", "pandora ' s box", "....", "...", ".", ",", 
":", "", "", "", "", NA, NA), with = c("character(0)", "character(0)", 
"goodbye", "goodbye", "good x", "ill at xease", "ill x", " xlike", 
" xwell", " xwell", " xwell", "as xwell", " ", " xwell", " xwell", 
". xWell", ": xwell", "well x", "xwell", " xwell", "xwell", "xWell", 
" xokay", " xokay", " xokay", " xokay", " xokay", " xokay", ". xOkay", 
": xokay", "okay", "xokay", "xokay", "xOkay", "xOkay", " xtoo", 
"xtoo /", "xtoo", "xtoo.", "xtoo.", "xtoo", "#\\\\3", "\\\\1\\\\3", 
"www", "am not", "let us", "will not", "can not", " not", "can not", 
" would", " will", " am", " have", " are", ".", ".", "", "", 
"", " ", " ", " ", "e", "e", "a", "c", "and", "percent", "percent", 
" ", "Que", "Ont", "Nfld", "Alta", "Man", "Sask", "St", "Ste", 
"ie", "Mr", "Ms", "Mrs", "Prof", "com", "am", "pm", " am", " pm", 
"Jan", "Feb", "Mar", "Apr", "Jun", "Jul", "Aug", "Sept", "Oct", 
"Nov", "Dec", "gen", "Dr", "e coli", "\\\\1\\\\2 ", "\\\\1\\\\3", 
"\\\\1\\\\3", "\\\\1\\\\3", "\\\\1\\\\3", "\\\\1\\\\3", "\\\\1dot\\\\3", 
"\\\\1 \\\\2", "\\\\1 \\\\2", "\\\\1 \\\\2", "bull's eye", "no man's land", 
"pandora's box", "", "", " . ", " ,", "", " ", " ", " ", " ", 
"character(0)", "character(0)")), .Names = c("replace", "with"
), row.names = c(NA, -127L), class = "data.frame")

#library
library(stringi)
#test string
test<-c('Sept.','Mr.' ,'Oct.', 'ill at ease', 'as well', 'Dr.', 'OK'   
, 'well,', '.com')
#data frame of patterns and replacements
punct<-data.frame(replace=c('ill at ease', 'Sept.', 'Mr.', 'Oct.', 'as    
well',    'Dr.', 'OK', 'well,', '.com'), with=c('ill at xease', 'Sept', 
'Mr', 'Oct', 'as   xwell', 'Dr', 'okay', 'xwell', 'com'))
#This works
stri_replace_all_regex(test, punct$replace, punct$with, vectorize_all=F)
#But this doesn't
stri_replace_all_regex(test, punct.out$replace, punct.out$with,    
vectorize_all=F)

第二个问题:我根据下面的评论解决了上面的问题。但是,一些正则表达式的出现存在一些具体问题。具体来说,我不知道如何转义反斜杠以打印在正则表达式中匹配的第一个和第二个模式,即 \1、\2 等。

#Define data
punct.out<-structure(list(replace = c("(\\.)([A-Z])(\\.)", "([A-Z])(\\.)([A-  
Z])", 
"([0-9])(\\.)([0-9])", "([a-z]+)(')", "(')   ([a-z]+)"), with =   
c("\\\\1\\\\2 ",                                                                                                          
"\\\\1\\\\3", "\\\\1dot\\\\3", "\\\\1 \\\\2", "\\\\1 \\\\2")), .Names = 
c("replace",                                                                                                                                                                           
"with"), row.names = c(104L, 105L, 110L, 112L, 113L), class = "data.frame")
#Test string of characters that the above regex's are supposed to match
test<-c('.B.', 'B.B', '1.1','premier\'s')
#This sort of works but I clearly haven't figured out how to properly escape 
the backslashes to capture the references
stri_replace_all_regex(test,punct.out$replace, punct.out$with, 
vectorize_all=F)
#Based on the help for stri_replace I also tried using $ to capture the    
references.
punct.out$with<-gsub('\\\\\\\\', '$', punct.out$with)
#And it did work.
stri_replace_all_regex(test,punct$replace, punct$with, vectorize_all=F)
4

1 回答 1

1

punct.out由缺失的观察组成。这就是为什么你会得到NAs 的输出。例如,您应该na.omit首先使用。此外,当您执行正则表达式匹配时,一些字符(例如,.)应该被转义,即,以反斜杠开头。另请注意,第一列中有一些空字符串 - 它们也应该被删除。

于 2016-05-19T14:28:19.913 回答