4

好的,我有一个多行字符串,我正在尝试对其进行一些清理。

每行可能是也可能不是一大块引用文本的一部分。例子:

This line is not quoted.
This part of the line is not quoted “but this is.”
This one is not quoted either.
“This entire line is quoted”
Not quoted.
“This line is quoted
and so is this one
and so is this one.”
This is not quoted “but this is
and so is this.”

我需要一个正则表达式替换,它将解开硬包装的引号行,即用空格替换“\r\n”,但只能在大引号之间。

以下是更换后的外观:

This line is not quoted.
This part of the line is not quoted “but this is.”
This one is not quoted either.
“This entire line is quoted”
Not quoted.
“This line is quoted and so is this one and so is this one.”
This is not quoted “but this is and so is this.”

(注意最后两行是输入文本中的多行。)

约束

  • 理想情况下需要一个正则表达式替换调用
  • 使用 .NET RegEx 库
  • 引号始终是开始/结束的大引号,而不是普通的双引号 ("),这应该会使这更容易一些。

重要约束

这不是直接的 .NET 代码,我正在填充一个“searchfor/replacewith”字符串表,然后通过 RegEx.Replace 调用这些字符串。我无法添加自定义代码,如匹配评估器、循环捕获的组等。

到目前为止,当前的答案大致如下:

r.Replace("(?<=“)\r\n(?=”)", " ")

显然,我什至还没有接近。

相同的逻辑可以应用于编程代码中块注释的颜色编码——块注释内的任何内容都不会与注释外的内容相同。(代码有点棘手,因为开始/结束块注释分隔符也可以合法地存在于文字字符串中,我不必在这里处理这个问题。)

4

5 回答 5

4

Assuming all curly quotes are properly balanced, this regex should do what you want:

@"[\r\n]+(?=[^“”]*”)"

The [\r\n]+ will match one or more line separators of any type--Unix (\n), DOS (\r\n) or older Mac (\r). Then the lookahead asserts that there's a close-quote ahead and that there's no open-quote between here and there. Then your replacement text can be a simple space character.

于 2009-03-04T01:24:49.300 回答
1

NB: For testing regexes I use http://gskinner.com/RegExr/ which is very useful.

I don't think you can write a single expression that will replace an undefined number of newlines. However, you can write an expression to replace one or several, and either repeatedly run it or write it to deal with the max number of newlines you'll have within one quoted section.

First, you want single-line mode so that your expression matches the whole input string instead of line by line. Put this at the start of your expression to turn it on:

(?s)

Then, you want a look-behind expression to match the start quote:

(?<=“)

And a look-ahead to match the end quote:

(?=”)

Now an expression to match some text, then a newline, then some text:

([^”\r]*)\r?([^”\r]*)

Note that there are two capturing groups for the bits of text around the newline, so you can include that text in your replace expression. This will match text that has just one newline within the quotes. To extend this to two newlines, just add another optional newline and optional following text:

(?s)(?<=“)([^”\r]*)\r?([^”\r]*)\r?([^”\r]*)(?=”)

You could extend this to match as many newlines as you think might occur. Not perfect, but perhaps sufficient. Or if you can repeatedly run the expression on your text then just replace a single one at a time.

Leaving your expression something like this:

r.Replace("(?s)(?<=“)([^”\r]*)\r?([^”\r]*)", "$1 $2")

(This isn't quite correct as it'll add a space after text even if group two doesn't match... but it's a start)

于 2009-03-03T23:46:03.827 回答
0

所以要做的是找到一个以开引号开头的字符串,后跟一个包含右引号或任何 \r \n 字符的字符串,然后是一系列一个或多个 \r \n 字符,捕获除终端 \r \n 字符之外的所有内容,并将整个匹配替换为捕获的部分。

——马库斯

于 2009-03-03T21:43:56.930 回答
0
于 2009-03-03T22:21:12.423 回答
0

You can not do what you want within the limits you have described.

Proof:

  • Your fixed table of replacements will execute a fixed number of calls to replace (call this n)
  • Each replace will only be able to eliminate a fixed number of line breaks (call this number m).

Therefore

  • A quoted block with m*n+1 line breaks will not be properly dealt with.

You either need to increase the power of your setup (e.g. by allowing more complex replacement, recursive replacements, an indefinite repetition flag, or...?) or accept the fact that this task can't be done by your engine.

-- MarkusQ

于 2009-03-03T23:41:45.163 回答