regex - 清理编号列表的正则表达式

Question

我才刚刚开始使用正则表达式，似乎有点卡住了！我在 TextSoap 中使用多行编写了批量查找和替换。这是为了清理我 OCR 的食谱，因为有成分和方向，我不能将“1”更改为“1.”，因为这可能会将“1 Tbsp”重写为“1. Tbsp”。

因此，我使用此代码作为查找来检查以下两行（可能带有额外的行）是否是下一个序列号：

^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n))
^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n))
^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n))
^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n))
^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n))

以及以下内容作为上述各项的替换：

$1. $2 $3 $4$5

我的问题是，虽然它可以按我的意愿工作，但它永远不会执行最后三个数字的任务......

我要清理的文本示例：

1 This is the first step in the list

2 Second lot if instructions to run through
3 Doing more of the recipe instruction

4 Half way through cooking up a storm

5 almost finished the recipe

6 Serve and eat

我希望它看起来像什么：

1. This is the first step in the list

2. Second lot if instructions to run through

3. Doing more of the recipe instruction

4. Half way through cooking up a storm

5. almost finished the recipe

6. Serve and eat

有没有办法检查上面的前一行或两行以向后运行？我已经向前看和向后看，那时我有些困惑。有人有办法清理我的编号列表或帮助我使用我想要的正则表达式吗？

score 2 · Accepted Answer

dan1111 是对的。您可能会遇到外观相似的数据的麻烦。但是鉴于您提供的示例，这应该可以工作：

^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search

$1. $2\r\n\r\n                 // replace

如果您不使用 Windows，请\r从替换字符串中删除 s。

解释：

^           // beginning of the line
(\d+)       // capture group 1. one or more digits
\s+         // any spaces after the digit. don't capture
([^\r\n]+)  // capture group 2. all characters up to any EOL
(?:[\r\n]*) // consume additional EOL, but do not capture

代替：

$1.       // group 1 (the digit), then period and a space
$2        // group 2
\r\n\r\n  // two EOLs, to create a blank line
          // (remove both \r for Linux)

score 1 · Accepted Answer

那这个呢？

1 Tbsp salt
2 Tsp sugar
3 Eggs

您遇到了正则表达式的一个主要限制：当您的数据无法严格定义时，它们就无法正常工作。你可能直观地知道什么是成分，什么是步骤，但要从这些到一套可靠的算法规则并不容易。

我建议您考虑一种基于文件中位置的方法。给定的食谱通常对所有食谱的格式都相同：例如，首先是配料，然后是步骤列表。这可能是一种更容易区分的方法。

regex - 清理编号列表的正则表达式

2 回答 2

Related

Reference