我正在尝试使用正则表达式编写一个程序来清理一些数据。假设我有一个带有字母和数字的房间名称。在最终输出中,我需要使用“完整字符串(不包括字母和数字)+字母+数字”模式输出房间名称,如下例所示。但是,到目前为止,使用我编写的正则表达式,我得到了非常混乱的结果,这些结果在我的消息的底部。出于某种原因,它会将字母和字符放在某些行上,即使输入数据中可能没有。谢谢你。
已编辑:我对输入数据进行了编辑。我想概括代码以采用任意数量的字符串,而不仅仅是单个单词“ROOM”。
# the pattern should be "the full string (excluding letter & number) + letter + number". For example:
ATLANTA ROOM
ATLANTA ROOM 3
NEW YORK ROOM A 2
ROOM A 4
THE BIG AWESOME ROOM B
ROOM B 4
GEORGETOWN ROOM B 2
NEW YORK ROOM C 2
NEW YORK ROOM C
LOS ANGELES ROOM E 2
# program to clean with regular expressions. there could be multiple spaces between words
dd <- c("ATLANTA ROOM ",
" ATLANTA ROOM 3",
"NEW YORK A ROOM 2",
"4 ROOM A",
"THE BIG AWESOME ROOM B",
" ROOM 4 B",
"GEORGETOWN B 2 ROOM ",
" C NEW YORK ROOM 2",
"NEW YORK ROOM C",
"LOS ANGELES ROOM 2 E")
m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)
(dd2 <- paste(gsub("( +)", " ",
gsub("(^ +)|( +$)", "",
gsub("(\\<A|B|C|D|E|1|2|3|4\\>)", "", dd))),
regmatches(dd, m_char), regmatches(dd, m_num), sep = " "))
# actual output from the program
"TLANTA ROOMA3",
"TLANTA ROOMA2",
"NW YORK ROOMA4",
"ROOMA4",
"TH IG WSOM ROOME2",
"ROOMB2",
"GORGTOWN ROOMB2",
"NW YORK ROOMC3",
"NW YORK ROOMC2",
"LOS NGLS ROOMA4"