regex - 这个空白隐藏在哪里？

Question

我有一个字符向量，它是通过pdftotext（命令行工具）抓取的一些 PDF 文件。

一切都（幸福地）排列整齐。但是，该向量充满了一种无法避免我的正则表达式的空格：

> test
[1] "Address:"              "Clinic Information:"   "Store "                "351 South Washburn"    "Aurora Quick Care"    
[6] "Info"                  "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718"   "Pewaukee"  

> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
"Pewaukee")

> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
+                  "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
+                  "Pewaukee")

> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown"

显然，有一些字符没有在中分配dput，如下面的问题所示：

如何正确输入国际化文本？

我无法复制/粘贴整个向量....如何搜索和销毁这个非空白空白？

编辑

显然，我什至没有弄清楚，因为答案无处不在。这是一个更简单的测试用例：

> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE

在屏幕和输出中打印的单词“Clinic”和“Information”之间有一个空格dput，但字符串中的任何内容都不是标准空格。我的目标是消除这个，这样我就可以正确地 grep 那个元素了。

score 5 · Accepted Answer

将我的评论升级为答案：

您的字符串包含一个不间断空格 (U+00A0)，当您粘贴它时，该空格已转换为正常空格。使用 perl 风格的正则表达式可以轻松匹配 Unicode 中所有奇怪的类似空格的字符：

grepl("[0-9]+\\p{Zs}[A-Za-z ]+", test, perl=TRUE)

perl 正则表达式语法是\p{categoryName}，额外的反斜杠是包含反斜杠的字符串语法的一部分，而“Zs”是“分隔符”Unicode 类别，“空格”子类别。仅针对 U+00A0 字符的更简单方法是

grepl("[0-9]+[ \\xa0][A-Za-z ]+", test)

score 1 · Accepted Answer

我看不出空格有什么异常，但电话号码中的破折号是U+2010 (HYPHEN)，而不是 ASCII 连字符 ( U+002D)。

score 1 · Accepted Answer

我认为你是在尾随和领先的空白之后。如果是这样，这个功能可能会起作用：

Trim <- function (x) gsub("^\\s+|\\s+$", "", x)

还要留意标签等，这可能很有用：

clean <- function(text) {
    gsub("\\s+", " ", gsub("\r|\n|\t", " ", text))
}

所以使用干净，然后使用修剪，如下所示：

Trim(clean(test))

还要注意破折号 (–) 和破折号 (-)

score 0 · Accepted Answer

test <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
"Pewaukee")

> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE


library(stringr)
test2 <- str_trim(test, side = "both")

> grepl("[0-9]+ [A-Za-z ]+",test2)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# So there were no spaces in the vector, just the screen output in this case.

regex - 这个空白隐藏在哪里？

4 回答 4

Related

Reference