java - 正则表达式避免Java中不必要的回溯

Question

您好，我是正则表达式世界的新手。我想在我的 Java 测试字符串中提取时间戳、位置和“id_str”字段。

20110302140010915|{"user":{"is_translator":false,"show_all_inline_media":false,"following":null,"geo_enabled":true,"profile_background_image_url":"http:\/\/a3.twimg.com\/a\/1298918947\/images\/themes\/theme1\/bg.png","listed_count":0,"favourites_count":2,"verified":false,"time_zone":"Mountain Time (US & Canada)","profile_text_color":"333333","contributors_enabled":false,"statuses_count":152,"profile_sidebar_fill_color":"DDEEF6","id_str":"207356721","profile_background_tile":false,"friends_count":14,"followers_count":13,"created_at":"Mon Oct 25 04:05:43 +0000 2010","description":null,"profile_link_color":"0084B4","location":"WaKeeney, KS","profile_sidebar_border_color":"C0DEED",

我试过这个

(\d*).*?"id_str":"(\d*)",.*"location":"([^"]*)"

如果我使用惰性量词（regexbuddy 中的 3000 步），它有很多回溯.*?，但锚点“id_str”和“位置”之间的字符数并不总是相同的。此外，如果在字符串中找不到位置，则可能是灾难性的。

如何避免 1) 不必要的回溯？

和

2）更快地找到不匹配的字符串？

谢谢。

score 5 · Accepted Answer

这看起来像 JSON，相信我，用这种方式解析它很容易。

String[] input = inputStr.split("|", 2);
System.out.println("Timestamp: " + input[0]); // 20110302140010915

JSONObject user = new JSONObject(input[1]).getJSONObject("user");

System.out.println ("ID: " + user.getString("id_str")); // 207356721
System.out.println ("Location: " + user.getString("location")); // WaKeeney, KS

参考：
JSON Java API 文档

score 3 · Accepted Answer

你可以试试这个：

(\d*+)(?>[^"]++|"(?!id_str":))+"id_str":"(\d*+)",(?>[^"]++|"(?!location":))+"location":"([^"]*+)"

这里的想法是尽可能地消除回溯，只使用所有格量词和具有受限字符类的原子组（就像你在上一个捕获组中所做的那样）

例如，为了避免第一个惰性量词，我使用这个：

(?>[^"]++|"(?!id_str":))+

正则表达式引擎将尽可能多地采用不是双引号的所有字符（并且不注册单个回溯位置，因为使用了所有格量词），当找到双引号时，先行检查它是否没有跟随锚id_str":。所有这部分都由重复一次或多次的原子组（内部不可能回溯）包裹。

不要害怕使用内部的前瞻，它会很快失败并且只有在找到双引号时才会失败。i但是，如果您确定 a 的频率低于"（或之前的稀有字符，如果您发现），您可以尝试相同的操作：

(?>[^i]++|i(?!d_str":))+id_str":(...

编辑：这里最好的选择似乎,是不那么频繁：（200 步与双引号 422 步）

(\d*+)(?>[^,]++|,(?!"id_str":))+,"id_str":"(\d*+)",(?>[^,]++|,(?!"location":))+,"location":"([^"]*+)"

为了获得更好的性能，并且如果您有可能，请尝试在^您的模式中添加一个锚点 ( )，如果它是字符串或换行符的开头（使用多行模式）。

^(\d*+)(?>[^"]++|"(?!id_str":))+"id_str":"(\d*+)",(?>[^"]++|"(?!location":))+"location":"([^"]*+)"

java - 正则表达式避免Java中不必要的回溯

2 回答 2

Related

Reference