我有一个字符向量,它是通过pdftotext
(命令行工具)抓取的一些 PDF 文件。
一切都(幸福地)排列整齐。但是,该向量充满了一种无法避免我的正则表达式的空格:
> test
[1] "Address:" "Clinic Information:" "Store " "351 South Washburn" "Aurora Quick Care"
[6] "Info" "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718" "Pewaukee"
> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
"Pewaukee")
> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
+ "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
+ "Pewaukee")
> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8" "unknown"
显然,有一些字符没有在 中分配dput
,如下面的问题所示:
我无法复制/粘贴整个向量....如何搜索和销毁这个非空白空白?
编辑
显然,我什至没有弄清楚,因为答案无处不在。这是一个更简单的测试用例:
> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE
在屏幕和输出中打印的单词“Clinic”和“Information”之间有一个空格dput
,但字符串中的任何内容都不是标准空格。我的目标是消除这个,这样我就可以正确地 grep 那个元素了。