如果您使用的是伪 XML,最好自己定义解析规则。我喜欢stringr
和dplyr
喜欢这样的东西。
这是一个二元素向量(在您的情况下不是 312):
vec <- c(
"<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >",
"<severity='5', hostname='computername126', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='235' >"
)
将其转换为data.frame
对象:
df <- data.frame(vec, stringsAsFactors = FALSE)
并根据他们的字符索引位置选择你的数据,相对于你感兴趣的变量的位置:
require(stringr)
require(dplyr)
df %>%
mutate(
severityStr = str_locate(vec, "severity")[, "start"],
hostnameStr = str_locate(vec, "hostname")[, "start"],
sourceStr = str_locate(vec, "source")[, "start"],
moduleStr = str_locate(vec, "module")[, "start"],
processStr = str_locate(vec, "process")[, "start"],
pidStr = str_locate(vec, "pid")[, "start"],
endStr = str_locate(vec, ">")[, "start"],
severity = substr(vec, severityStr + 10, hostnameStr - 4),
hostname = substr(vec, hostnameStr + 10, sourceStr - 4),
source = substr(vec, sourceStr + 8, moduleStr - 4),
module = substr(vec, moduleStr + 8, processStr - 4),
process = substr(vec, processStr + 9, pidStr - 4),
pid = substr(vec, pidStr + 5, endStr - 3)) %>%
select(severity, hostname, source, module, process, pid)
这是生成的数据框:
severity hostname source module process pid
1 4 computername125 PackageDownload herpderp.dll masterP.exe 234
2 5 computername126 PackageDownload herpderp.dll masterP.exe 235
该解决方案足够强大,可以处理不同长度的字符串输入。例如,pid
即使它是95
(两位而不是三位),它也会正确读取。