ruby-on-rails - Ruby/Rails 使用正则表达式从一个标签扫描/匹配到另一个文本外

Question

在以下内容示例中，我将行换行以使其在 Stackoverflow 上更易于阅读（因此您不必向右滚动即可查看示例）。

内容一：

"Lorem Ipsum\r\n
[img]http://example.org/first.jpg[/img]\r\n
[img]http://example.org/second.jpg[/img]\r\n
more lorem ipsum ..."

内容 B：

"Lorem Ipsum\r\n
[img caption="Sample caption"]http://example.org/third.jpg[/img]
[img]http://example.org/fourth.jpg[/img]"

内容 C：

"Lorem Ipsum [img]http://example.org/fifth.jpg[/img]\r\n
more lorem ipsum\r\n\r\n
[img caption="Some other caption"]http://example.org[/img]"

我试过的：

content.match(/\[img\]([^<>]*)\[\/img\]/imu)
return example: "[img]...[/img]\r\n[img]...[/img]
content.scan(/\[img\]([^<>]*)\[\/img\]/imu)
return example: "...[/img]\r\n[img]..."

在上述 3 个内容示例上运行扫描/匹配/正则表达式解决方案时，我想要完成的是获取每次出现的[img]...[/img]并将[img caption="?"]...[/img]其放入数组中以供以后使用。

Array
  1 : A : [img]http://example.org/first.jpg[/img]
  2 : A : [img]http://example.org/second.jpg[/img]
  3 : B : [img caption="Sample caption"]http://example.org/third.jpg[/img]
  4 : B : [img]http://example.org/fourth.jpg[/img]
  5 : C : [img]http://example.org/fifth.jpg[/img]
  6 : C : [img caption="Some other caption"]http://example.org[/img]

将“剥离的内容”限制在只有打开和关闭标记的地方也很有帮助，这意味着当有[img]/[img caption="?"]和之后丢失[/img]时，忽略它。

我已经阅读了http://www.ruby-doc.org/core-1.9.3/String.html上下但找不到任何似乎对此有用的东西。

更新：

所以我认为这是：

\[img([^<>]*)\]([^<>]*)\[\/img\]

会发现：

[img]something[/img]

和：

[img caption="something"]something[/img]

现在我只需要知道如何捕捉不同内容中的每一次出现。我总是可以从第一个到最后一个 [img][/img] 标签中获取它，所以当中间有其他 Lorem Ipsum 时，它也会被抓取。

score 2 · Accepted Answer

您可以使用/\[img(?:\s+caption=".+")?\].+?\[\/img\]/扫描文档：

regex = /\[img(?:\s+caption=".+")?\].+?\[\/img\]/

text = <<EOT
Lorem Ipsum
[img]http://example.org/first.jpg[/img]
[img]http://example.org/second.jpg[/img]
more lorem ipsum ...

Content B:

Lorem Ipsum
[img caption="Sample caption"]http://example.org/third.jpg[/img]
[img]http://example.org/fourth.jpg[/img]

Content C:

Lorem Ipsum [img]http://example.org/fifth.jpg[/img]
more lorem ipsum

[img caption="Some other caption"]http://example.org[/img]
EOT

array = text.scan(regex)
puts array

生成：

[img]http://example.org/first.jpg[/img]
[img]http://example.org/second.jpg[/img]
[img caption="示例标题"]http://example.org/third.jpg[/img]
[img]http://example.org/fourth.jpg[/img]
[img]http://example.org/fifth.jpg[/img]
[img 标题="其他标题"]http://example.org[/img]

如果您想忽略标签而只抓取内容，请将正则表达式更改为：

regex = /\[img(?:\s+caption=".+")?\](.+?)\[\/img\]/

使用该更改再次运行返回：

http://example.org/first.jpg
http://example.org/second.jpg
http://example.org/third.jpg
http://example.org/fourth.jpg
http://example.org/fifth.jpg
http://example.org

（红色证明）

如果您需要查找不同的标签，您可以轻松生成“OR”列表：

Regexp.union(%w[foo img bar])
=> /foo|img|bar/

如果您需要确保预先转义“魔术”字符：

Regexp.union(%w[foo img bar].map{ |s| Regexp.escape(s) })

score 1 · Accepted Answer

幸运的是，我已经在自己的应用程序中解决了这个问题！

鉴于这@tags是一个标签数组（如["img"]）：

regex = /\[(#{@tags.join("|")})\s*(.*?)?\/?\](?:(.*?)\[\/\1\])?/
matches = content.scan(regex)

完整示例：

require 'pp'

@tags = %w(img)
regex = /\[(#{@tags.join("|")})\s*(.*?)?\/?\](?:(.*?)\[\/\1\])?/

content = <<-EOF
  Lorem Ipsum\r\n
  [img]http://example.org/first.jpg[/img]\r\n
  [img]http://example.org/second.jpg[/img]\r\n
  more lorem ipsum ..."
  Content B:

  "Lorem Ipsum\r\n
  [img caption="Sample caption"]http://example.org/third.jpg[/img]
  [img]http://example.org/fourth.jpg[/img]"
  Content C:

  "Lorem Ipsum [img]http://example.org/fifth.jpg[/img]\r\n
  more lorem ipsum\r\n\r\n
  [img caption="Some other caption"]http://example.org[/img]"
EOF

matches = content.scan(regex)
pp matches

并输出：

[["img", "", "http://example.org/first.jpg"],
 ["img", "", "http://example.org/second.jpg"],
 ["img", "caption=\"Sample caption\"", "http://example.org/third.jpg"],
 ["img", "", "http://example.org/fourth.jpg"],
 ["img", "", "http://example.org/fifth.jpg"],
 ["img", "caption=\"Some other caption\"", "http://example.org"]]

ruby-on-rails - Ruby/Rails 使用正则表达式从一个标签扫描/匹配到另一个文本外

2 回答 2

Related

Reference