java - 几乎 JSON 的正则表达式，但不完全

Question

大家好，我正在尝试将一个格式良好的字符串解析为它的组件。该字符串非常类似于 JSON，但严格来说它不是 JSON。它们是这样形成的：

createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source="Region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}

输出就像文本块一样，此时不需要做任何特别的事情。

createdAt=Fri Aug 24 09:48:51 EDT 2012 
id=238996293417062401 
text='Test Test' 
source="Region"
entities=[foo, bar] 
user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}

使用以下表达式，我可以将大部分字段分开

,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))(?=(?:[^']*'[^']*')*(?![^']*'))

它将拆分所有逗号而不是任何类型的引号，但我似乎无法跳到它拆分逗号而不是括号或大括号的位置。

score 2 · Accepted Answer

因为您要处理嵌套的括号/括号，所以处理它们的“正确”方法是分别标记它们，并跟踪您的嵌套级别。因此，对于不同的令牌类型，您确实需要多个正则表达式，而不是单个正则表达式。

这是 Python，但转换为 Java 应该不会太难。

# just comma
sep_re = re.compile(r',')

# open paren or open bracket
inc_re = re.compile(r'[[(]')

# close paren or close bracket
dec_re = re.compile(r'[)\]]')

# string literal
# (I was lazy with the escaping. Add other escape sequences, or find an
# "official" regex to use.)
chunk_re = re.compile(r'''"(?:[^"\\]|\\")*"|'(?:[^'\\]|\\')*[']''')

# This class could've been just a generator function, but I couldn;'t
# find a way to manage the state in the match function that wasn't
# awkward.
class tokenizer:
  def __init__(self):
    self.pos = 0

  def _match(self, regex, s):
    m = regex.match(s, self.pos)
    if m:
      self.pos += len(m.group(0))
      self.token = m.group(0)
    else:
      self.token = ''
    return self.token

  def tokenize(self, s):
    field = '' # the field we're working on
    depth = 0  # how many parens/brackets deep we are
    while self.pos < len(s):
      if not depth and self._match(sep_re, s):
        # In Java, change the "yields" to append to a List, and you'll
        # have something roughly equivalent (but non-lazy).
        yield field
        field = ''
      else:
        if self._match(inc_re, s):
          depth += 1
        elif self._match(dec_re, s):
          depth -= 1
        elif self._match(chunk_re, s):
          pass
        else:
          # everything else we just consume one character at a time
          self.token = s[self.pos]
          self.pos += 1
        field += self.token
    yield field

用法：

>>> list(tokenizer().tokenize('foo=(3,(5+7),8),bar="hello,world",baz'))
['foo=(3,(5+7),8)', 'bar="hello,world"', 'baz']

这个实现有一些捷径：

字符串转义真的很懒惰：它只支持\"双引号字符串和\'单引号字符串。这很容易解决。
它只跟踪嵌套级别。它不验证括号是否与括号匹配（而不是括号）。如果您关心这一点，您可以更改depth为某种堆栈并将其推入/弹出括号/括号。

score 1 · Accepted Answer

您可以使用以下正则表达式来匹配您想要的块，而不是用逗号分割。

(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)

Python：

import re
text = "createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source=\"Region\", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}"
re.findall(r'(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)', text)

>> [
    ('createdAt', 'Fri Aug 24 09:48:51 EDT 2012'), 
    ('id', '238996293417062401'), 
    ('text', "'Test Test'"), 
    ('source', '"Region"'), 
    ('entities', '[foo, bar]'), 
    ('user', '{name=test, locations=[loc1,loc2], locations={comp1, comp2}}')
   ]

我已经设置了分组，因此它将“键”和“值”分开。它会在 Java 中做同样的事情 - 在这里查看它在 Java 中的工作：

http://www.regexplanet.com/cookbook/ahJzfnJlZ2V4cGxhbmV0LWhyZHNyDgsSBlJlY2lwZRj0jzQM/index.html

正则表达式解释：

(?:^| )匹配行首或空格的非捕获组
(.+?)匹配...之前的“key”
=等号
(\{.+?\}|\[.+?\]|.+?)匹配一组{字符}、[字符]或最后只是字符
(?=,|$)向前看，匹配 a,或行尾。

java - 几乎 JSON 的正则表达式，但不完全

2 回答 2

Related

Reference