java - 在拆分时捕获所有以“[[”开头并以“]]”结尾的字符串作为正则表达式

Question

目前我正在使用str.toLowerCase.split("[\\s\\W]+")摆脱空白和标点符号，但有一类特殊的字符串我想保留并排除在这种处理之外：

[[...multiple words...]]

例子：

[[Genghis Khan]]

应保持为

[[Genghis Khan]]

我应该使用哪种正则表达式？

score 8 · Accepted Answer

您的正则表达式并不遥远：

def tokenize(s: String) = """\w+|(\[\[[^\]]+\]\])""".r.findAllIn(s).toList

接着：

scala> tokenize("[[Genghis Khan]] founded the [[Mongol Empire]].")
res1: List[String] = List([[Genghis Khan]], founded, the, [[Mongol Empire]])

不过，这是 Scala 的解析器组合器的一个很好的用例：

import scala.util.parsing.combinator._

object Tokenizer extends RegexParsers {
  val punc = "[,;:\\.]*".r
  val word = "\\w+".r
  val multiWordToken = "[[" ~> "[^\\]]+".r <~ "]]"
  val token = (word | multiWordToken) <~ punc
  def apply(s: String) = parseAll(token+, s)
}

这同样给了我们：

scala> Tokenizer("[[Genghis Khan]] founded the [[Mongol Empire]].").get
res2: List[String] = List(Genghis Khan, founded, the, Mongol Empire)

我个人更喜欢解析器组合器版本——它实际上是自记录的，并且更容易扩展和维护。

score 0 · Accepted Answer

拆分不是处理这个问题的方法，因为它不处理上下文。你可能这样写：

str.toLowerCase.split("(?<!\\[\\[([^]]|\\][^]])*\\]?)[\\s\\W]+")

它会在任何前面没有[[后面跟着除之外的任何空间上拆分]]，但 Java 不喜欢可变大小的后视。

在我看来，处理这个问题的最好方法是为它编写一个解析器，除非你真的需要速度。使用像Travis Brown建议的正则表达式（他还在他的回答中展示了一个解析器）。

score 0 · Accepted Answer

[[这是一个首先在或上拆分的函数]]。这样做可确保拆分项在未引用字符串和引用字符串之间交替（即，第二个、第四个等项被“引用”）。然后我们可以遍历这个列表并在空白处拆分任何未引用的项目，同时保留引用的项目不变。

def mySplit(s: String) = 
  """(\[\[)|(\]\])""".r.split(s).zipWithIndex.flatMap { 
    case (unquoted, i) if i%2==0 => unquoted.trim.split("\\s+")
    case (quoted, _) => List(quoted)
  }.toList.filter(_.nonEmpty)

mySplit("this [[is]] the first [[test string]].") // List(this, is, the, first, test string, .)
mySplit("[[this]] and [[that]]")          // List(this, and, that)
mySplit("[[this]][[that]][[the other]]")  // List(this, that, the other)

如果你想[[ ]]在最终输出中，那么只需将上面的内容更改List(quoted)为List("[[" + quoted + "]]")

java - 在拆分时捕获所有以“[[”开头并以“]]”结尾的字符串作为正则表达式

3 回答 3

Related

Reference