从您的链接看来,文本遵循以下规则。
- 每条“线”确实是严格意义上的一条线,即用 .分隔
\n
。
- 段落由两个或多个连续的新行分隔,_i.e.
\n\n+
.
- 只有无方向的单引号
'
用于区分语音。
这是一个快速的尝试(一直向下滚动以查看比赛组) ——我敢肯定,有缺陷——但这里有足够的内容可以引导你朝着正确的方向前进。请注意,如果您连接三个捕获组(习惯上称为$1
、$2
和$3
),您将如何获得每个字符的语音,包括“所说”分隔符之间的标点符号。但是,请注意语言的某些怪癖如何摆脱这个正则表达式 - 例如,我们不会在段落末尾关闭引号,但如果演讲继续到下一段,则打开新引号,这会破坏整个平衡 -引用策略——撇号也是如此。
\n\n.*?'([^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:([.]) |, )'([^^]+?)'(?=[^']*(?:'[^']')*[^']*\n\n.*'(?:[^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:[.] |, ))
| | | <----><--> <>|<-------------------><------------>| <----> |<--------------------------------------------------------------------------------->
| | | | | | || | | | ||
| | | | | | || | | | |assert that this end-quote is followed by a string of non-quote characters, then
| | | | | | || | | | |zero or more strings of quoted non-quote characters, then another string of non-
| | | | | | || | | | |quote characters, a new paragraph, and the next "said Bernard"; otherwise fail.
| | | | | | || | | | |
| | | | | | || | | | match an (end-)quote
| | | | | | || | | |
| | | | | | || | | match any character as needed (but no more than needed)
| | | | | | || | |
| | | | | | || | match a (start-)quote
| | | | | | || |
| | | | | | || match either a period followed by two spaces, or a comma followed by one space
| | | | | | ||
| | | | | | |match the "said Bernard"
| | | | | | |
| | | | | | match an (end-)quote
| | | | | |
| | | | | match a comma, optionally
| | | | |
| | | | match a question mark, optionally
| | | |
| | | match any character as needed (but no more than needed)
| | |
| | match a (start-)quote
| |
| match as many non-newline characters as needed (but no more than needed)
|
new paragraph
Rubular 匹配(摘录):
Match 3
1. But when we sit together, close
2.
3. we melt into each
other with phrases. We are edged with mist. We make an
unsubstantial territory.
Match 4
1. I see the beetle
2. .
3. It is black, I see; it is green,
I see; I am tied down with single words. But you wander off; you
slip away; you rise up higher, with words and words in phrases.