regex - 如何使用正则表达式从文本元素中提取基于定位的单词？

Question

我正在寻求帮助，以便能够（使用 REGEX）从下面列出的文本元素中提取模型。

2007 本田 CR-V LX 清洁！！
2008 本田雅阁 EX ROOF MAGS 清洁 1 所有者
2008 本田思域 EX-L CUIR TOIT 皮革
2009 Toyota Corolla S 全装备

常数因素是，
模型总是第三个词

在此先感谢您的帮助。

score 1 · Accepted Answer

我将与正则表达式匹配\d{4}以获取第一个 4 位数字（年份），然后用空格分隔（使用您使用的任何语言），然后从中获取第二个和第三个单词。

你甚至可以将它从空格中分离出来并使用它，例如在 Ruby 中：

array=my_name.split(" ")
year=array[0]
make=array[1]
model=array[2]

基本上我不认为正则表达式是这里最好的解决方案。

score 1 · Accepted Answer

Try this simple one:

(\d+)\s*(\w+)\s*(.+)

and get groups.

explain:

\d+        digits (0-9) 
           (1 or more times, matching the most amount possible)

\s*        whitespace (\n, \r, \t, \f, and " ") 
           (0 or  more times, matching the most amount possible)

\w+        word characters (a-z, A-Z, 0-9, _) 
           (1 or more times, matching the most amount possible)

.+         any character except \n 
           (1 or more times, matching the most amount possible)

score 1 · Accepted Answer

如果你必须使用正则表达式，它是

^(\d{4}) +([^ ]+) +([^ ]+) +(.*)$

\1 是年份，\2 是品牌，\3 是型号，\4 是其余的。但是，如果有任何模型有两个单词（例如 Crown Victoria），这将不起作用，除非您用空格以外的其他东西分隔单词（例如 Crown_Victoria）。

score 0 · Accepted Answer

请检查此链接：正则表达式实施

([0-9]*).\b([a-zA-z]*).\b([a-zA-z-.]*).\b(.*)

您将获得 3 组：

2007年
本田
CR-V

编辑

如果您使用的是 c# 语言，那么这将是获得model

string page = "2007 Honda CR-V LX CLEAN !!";
Regex reg = new Regex(@"(?<year>[0-9]*).\b(?<make>[a-zA-z]*).\b(?<model>[a-zA-z-.]*).\b(?<rest>.*)");
MatchCollection mc = reg.Matches(page);

foreach (Match m in mc)
{
    MessageBox.Show(m.Groups["model"]);
}

regex - 如何使用正则表达式从文本元素中提取基于定位的单词？

4 回答 4

Related

Reference