4

我正在尝试使用正则表达式编写一个程序来清理一些数据。假设我有一个带有字母和数字的房间名称。在最终输出中,我需要使用“完整字符串(不包括字母和数字)+字母+数字”模式输出房间名称,如下例所示。但是,到目前为止,使用我编写的正则表达式,我得到了非常混乱的结果,这些结果在我的消息的底部。出于某种原因,它会将字母和字符放在某些行上,即使输入数据中可能没有。谢谢你。

已编辑:我对输入数据进行了编辑。我想概括代码以采用任意数量的字符串,而不仅仅是单个单词“ROOM”。

# the pattern should be "the full string (excluding letter & number) + letter + number". For example:
ATLANTA ROOM
ATLANTA ROOM 3
NEW YORK ROOM A 2
ROOM A 4
THE BIG AWESOME ROOM B
ROOM B 4
GEORGETOWN ROOM B 2
NEW YORK ROOM C 2
NEW YORK ROOM C
LOS ANGELES ROOM E 2

# program to clean with regular expressions. there could be multiple spaces between words
dd <- c("ATLANTA ROOM ",
    " ATLANTA ROOM  3",
    "NEW YORK A ROOM   2",
    "4 ROOM A",
    "THE BIG AWESOME ROOM B",
    " ROOM 4 B",
    "GEORGETOWN B 2 ROOM ",
    " C NEW YORK ROOM 2",
    "NEW YORK ROOM C",
    "LOS ANGELES ROOM 2  E")

m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)

(dd2 <- paste(gsub("( +)", " ",
                   gsub("(^ +)|( +$)", "",
                        gsub("(\\<A|B|C|D|E|1|2|3|4\\>)", "", dd))),
              regmatches(dd, m_char), regmatches(dd, m_num), sep = " "))

# actual output from the program
"TLANTA ROOMA3",
"TLANTA ROOMA2",
"NW YORK ROOMA4",
"ROOMA4", 
"TH IG WSOM ROOME2",
"ROOMB2",
"GORGTOWN ROOMB2",
"NW YORK ROOMC3", 
"NW YORK ROOMC2",
"LOS NGLS ROOMA4"
4

3 回答 3

4

这是一个尝试:

sub(' $', '', # clean up spaces at the end
    gsub(' +', ' ', # clean up double spaces
         # rearrange letter and numbers
         sub('^([A-Z]?)([0-9]*)([A-Z]?)$', 'ROOM \\1\\3 \\2',
             gsub(' |ROOM', '', dd)    # remove spaces and ROOM
            )
        )
   )
#[1] "ROOM"     "ROOM 3"   "ROOM A 2" "ROOM A 4" "ROOM B"   "ROOM B 4" "ROOM B 2"
#[8] "ROOM C 2" "ROOM C"   "ROOM E 2"

下面是编辑后的 ​​OP 和评论的相同逻辑(假设房间名称是至少有 3 个字母和最多 2 个字母的房间名称的单词):

gsub('(^ | $)', '', # clean up spaces in front or end
     gsub(' +', ' ', # clean up double spaces
          # extract room name and put it in front of the letter and number
          paste(gsub('\\b([A-Z][A-Z]?|[0-9]+)\\b', '', dd, perl = T),
                sub('^([A-Z]+)?([0-9]*)([A-Z]+)?$', '\\1\\3 \\2',
                    gsub(' |\\w\\w\\w+', '', dd)    # remove spaces and words
                   )
               )
         )
    )
于 2013-09-25T15:55:50.360 回答
2

因此,正在发生的事情是,例如您的程序只有 8 个字母,因此不是插入“”或 NA,而是回收它们。

这是一个修复:

m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)

numbers <- rep("", length(dd))
numbers[m_num>0] <- regmatches(dd, m_num)

letters <- rep("", length(dd))
letters[m_char>0] <- regmatches(dd, m_char)

output <- trim(paste("ROOM", letters, numbers))

[1] “房间” “房间 3” “房间 A 2” “房间 A 4” “房间 B” “房间 B 4” “房间 B 2” “房间 C 2” “房间 C”
[10] “房间 E 2” "

于 2013-09-25T16:07:15.910 回答
0

尝试这个:

library(gsubfn)

# extract numbers (num) and room letters (char)
num <- sapply(strapplyc(dd, "\\d|$"), paste, collapse = "")
char <- sapply(strapplyc(dd, "[A-D]|$"), paste, collapse = "")

# put back together and sort
out <- sort(paste("ROOM", char, num))

# trim spaces (optional)
out <- gsub(" +", " ", sub(" *$", "", out))

> out
 [1] "ROOM"     "ROOM 2"   "ROOM 3"   "ROOM A 2" "ROOM A 4" "ROOM B"  
 [7] "ROOM B 2" "ROOM B 4" "ROOM C"   "ROOM C 2"

更新:小改进

于 2013-09-25T16:13:11.860 回答