string - R：搜索文本字符串并在两个向量中返回日期和价格的函数

Question

我得到了以下提示：

给定一个文本字符串向量 V.text，编写一个函数，从每个字符串中提取可能的美元金额和日期，并将它们作为与 V.text 长度相同的列表的单独向量分量返回。金额和日期应以与输入格式完全相同的文本字符串形式返回。例如，如果输入字符串之一是“Listed on 1/05/2009 for 180000 and sold for $150,250 on 3/1/2009”，则该元素的输出应该是一个包含两个向量的列表，一个代表金额，一个代表金额日期。金额应为“180000”和“$150,250”，日期应为“1/05/2009”和“3/1/2009”。

我的解决方案尝试是：

four <- function(x) {

  #split the data into individual observations
  lines <- str_split(x, "\n")

  n <- length(lines)
  list.date = NA; list.price = NA; sell.price = NA; sell.date = NA; temp = NA
  for (i in seq_len(n)) {
    list.date[i] <- word(x[i], 3)
    list.price[i] <- word(x[i], 5)
    sell.price[i] <- word(x[i], 9)
    sell.date[i] <- word(x[i], 11)
  }
  temp <- data.frame(list.date, list.price, sell.price, sell.date)
  temp
}

该解决方案因两个原因而不足。首先它输出一个数据框而不是一个包含两个向量的列表。其次，如果输入的文本字符串发生变化，我的解决方案就没用了。

对于收到的任何支持，我将不胜感激。

score 4 · Accepted Answer

举个例子来说明以前的答案想法gregexpr，regmatches例如：

ll <- c("Listed on 1/05/2009 for 180000 and sold for $150,250 on 3/1/2009",
        "Listed on 1/05/2012 for $300,400  and sold 120 for on 145,25")
## extract dates
dates <- regmatches(ll,gregexpr("[0-9]+\\/[0-9]+\\/[0-9]+",ll))
## remove dates 
ll <- gsub("[0-9]+\\/[0-9]+\\/[0-9]+",'',ll)
## extract amounts like 120 or 120,1254 
amounts <- regmatches(ll,gregexpr("\\$?[0-9]+(,[0-9]+)?",ll))

 dates
[[1]]
[1] "1/05/2009" "3/1/2009" 

[[2]]
[1] "1/05/2012"

> amounts
[[1]]
[1] "180000"   "$150,250"

[[2]]
[1] "$300,400" "120"      "145,25"

如果你想使用stringr包，你可以使用str_extract_all.

   str_extract_all(ll,"[0-9]+\\/[0-9]+\\/[0-9]+")
   ll <- gsub("[0-9]+\\/[0-9]+\\/[0-9]+",'',ll)
   str_extract_all(ll,"\\$?[0-9]+(,[0-9]+)?")

score 2 · Accepted Answer

在没有看到许多可能的字符串的情况下，我认为很难给出答案。这里有一些指示。

阅读正则表达式。这些是模式匹配模板，您可以将其应用于字符串并获得匹配结果。例如，简单的数字匹配类似 "\s[0-9]+\s" 的内容 - 转换为空格字符、一个或多个数字，然后是另一个空格。如果您知道这些数字至少为 3 位数，请匹配“\s[0-9][0-9][0-9]+\s”。通过一些摆弄，您可以将现金金额与美元符号和嵌入式逗号匹配。

您的日期与“[0-9]+/[0-9]+/[0-9]+”之类的匹配。当然，如果有人向您抛出一个带有“01/Jan/2010”的字符串，那么您需要一个正则表达式来匹配它。

所以，找出可能出现的正则表达式，匹配它们，看看你得到了多少匹配。

help(regexp)在 R 中将帮助您入门。

string - R：搜索文本字符串并在两个向量中返回日期和价格的函数

2 回答 2

Related

Reference