4

鉴于:

ABC
content 1
123
content 2
ABC
content 3
XYZ

是否可以创建与“ABC[\W\w]+?XYZ”的最短版本匹配的正则表达式

本质上,我正在寻找“ABC 后跟任何以 XYZ 结尾的字符,但如果我在两者之间遇到 ABC 则不匹配”(但将 ABC 视为潜在的正则表达式本身,因为它并不总是固定长度。 ..so ABC 或 ABcC 也可以匹配)

因此,更一般地说:REGEX1 后跟任何字符并由 REGEX2 终止,如果 REGEX1 出现在两者之间,则不匹配。

在这个例子中,我不想要前 4 行。

(我相信这个解释可能需要......进一步解释哈哈)

编辑:

好吧,我认为现在需要进一步解释!感谢您迄今为止的建议。在我开始研究如何将您提出的每个解决方案应用于我的问题时,我至少会给您更多的思考。

建议1:颠倒字符串内容和正则表达式。

这当然是一个非常有趣的 hack,它根据我的解释解决了问题。在简化问题时,我也没有提到同样的事情可能会反过来发生,因为结束签名也可能在以后存在(并且已证明在我的具体情况下)。这引入了如下所示的问题:

ABC
content 1
123
content 2
ABC
content 3
XYZ
content 4
MNO
content 5
XYZ

In this instance, I would check for something like "ABC through XYZ" meaning to catch [ABC, content 1, XYZ]...but accidentally catching [ABC, content 1, 123, content 2, ABC, content 3, XYZ]. Reversing that would catch [ABC, content 3, XYZ, content 4, MNO, content 5, XYZ] instead of the [ABC, content 2, XYZ] that we want again. The point is to try to make it as generalized as possible because I will also be searching for things that could potentially have the same starting signature (regex "ABC" in this case), and different ending signatures.

If there is a way to build the regexes so that they encapsulate this sort of limitation, it could prove much easier to just reference that any time I build a regex to search for in this type of string, rather than creating a custom search algorithm that deals with it.

Proposal 2: A+B+C+[^A]+[^B]+[^C]+XYZ with IGNORECASE flag

This seems nice in the case that ABC is finite. Think of it as a regex in itself though. For example:

Hello!GoodBye!Hello.Later.

VERY simplified version of what I'm trying to do. I would want "Hello.Later." given the start regex Hello[!.] and the end Later[!.]. Running something simply like Hello[!.]Later[!.] would grab the entire string, but I'm looking to say that if the start regex Hello[!.] exists between the first starting regex instance found and the first ending regex instance found, ignore it.

The convo below this proposal indicates that I might be limited by regular language limitations similar to the parentheses matching problem (Google it, it's fun to think about). The purpose of this post is to see if I do in fact have to resort to creating an underlying algorithm that handles the issue I'm encountering. I would very much like to avoid it if possible (in the simple example that I gave you above, it's pretty easy to design a finite state machine for...I hope that holds as it grows slightly more complex).

Proposal 3: ABC(?:(?!ABC).)*?XYZ with DOTALL flag

I like the idea of this if it actually allows ABC to be a regex. I'll have to explore this when I get in to the office tomorrow. Nothing looks too out of the ordinary at first glance, but I'm entirely new to python regex (and new to actually applying regexes in code instead of just in theory homework)

4

3 回答 3

7

A regex solution would be ABC(?:(?!ABC).)*?XYZ with the DOTALL flag.

于 2012-06-12T01:06:22.447 回答
1

Edit

So after reading your further explanations, I would say that my previous proposal, as well as MRAB's one are somehow similar and won't be of any help here. Your problem is actually the prolem of nested structures.

Think of your 'prefix' and 'suffix' as symbols. You could easily replace them with an opening and a closing parenthesis or whatever, and what you want is being able to match only the smallest (then deepest) pair ...

For example if your prefix is 'ABC.' and your suffix is 'XYZ.':

ABChello worldABCfooABCbarXYZ

You want to get only ABCbarXYZ.

It's the same if the prefix is (, and the suffix is ), the string:

(hello world(foo(bar)

It would match ideally only (bar) ...

Definitely you have to use a context free grammar (like programming languages do: C grammar, Python grammar) and a parser, or make your own by using regex as well as the iterating and storing mechanisms of your programming language.

But that's not possible with only regular expressions. They would probably help in your algorithm, but they just are not designed to handle that alone. Not the good tool for that job ... You cannot inflate tires with a screwdriver. Therefore, you will have to use some external mechanisms, not complicated though, to store the context, your position in the nested stack. Using your regular expression in each single context still may be possible.

Finite state machines are finite, and nested structures have an arbitrary depth that would require your automaton to grow arbitrarily, thus they are not regular languages.

Since recursion in a grammar allows the definition of nested syntactic structures, any language (including any programming language) which allows nested structures is a context-free language, not a regular language. For example, the set of strings consisting of balanced parentheses [like a LISP program with the alphanumerics removed] is a context-free language see here

Former proposal (not relevant anymore)

If I do:

>>> s = """ABC
content 1
123
content 2
ABC
content 3
XYZ"""
>>> r = re.compile(r'A+B+C+[^A]+[^B]+[^C]+XYZ', re.I)
>>> re.findall(r,s)

I get

['ABC\ncontent 3\nXYZ']

Is that what you want ?

于 2012-06-12T01:05:51.133 回答
0

There is another method of solving this problem: not trying to do it in one regex. You could split the string by the first regex, and then use the second one on the last part.

Code is the best explanation:

s = """ABC
content 1
123
content 2
ABC
content 3
XYZ
content 4
XYZ"""

# capturing groups to preserve the matched section
prefix = re.compile('(ABC)')
suffix = re.compile('(XYZ)')

# prefix.split(s) == ['', 'ABC', [..], 'ABC', '\ncontent 3\nXYZ\ncontent 4\nXYZ']
#                          prefixmatch ^^^^^  ^^^^^^^^^^^^ rest ^^^^^^^^^^^^^^^^
prefixmatch, rest = prefix.split(s)[-2:]

# suffix.split(rest,1) == ['\ncontent 3\n', 'XYZ', '\ncontent 4\nXYZ']
#                          ^^ interior ^^   ^^^^^ suffixmatch
interior, suffixmatch = suffix.split(rest,1)[:2]

# join the parts up.
result = '%s%s%s' % (prefixmatch, interior, suffixmatch)

# result == 'ABC\ncontent 3\nXYZ'

Some points:

  • there should be appropriate error handling (even just try: ... except ValueError: .. around the whole thing) to handle the case when either regex doesn't match at all and so the list unpacking fails.
  • this assumes that the desired segment will occur immediately after the last occurrence of prefix, if not, then you can iterate through the results of prefix.split(s) two at a time (starting at index 1) and do the same splitting trick with suffix to find all the matches.
  • this likely to be reasonably inefficient, since it creates quite a few intermediate data structures.
于 2012-06-12T12:16:21.097 回答