regex - 正则表达式捕捉弗吉尼亚伍尔夫小说 The Waves 中的对话？

Question

我们一群英语研究生正在学习弗吉尼亚伍尔夫的小说 The Waves中的对话，我一直在尝试在 TEI 中标记小说。为此，编写一个捕获对话框的正则表达式会很有用。值得庆幸的是，The Waves 非常有规律，几乎所有的对话都是这样的：

“现在他们都走了，”路易斯说。'我独自一人。他们已经进屋吃早餐了，”

但可以持续几段。我正在尝试编写一个正则表达式来匹配给定演讲者的所有段落。

这在Chris Foster 的博客文章中进行了简要讨论，他在其中提出了类似的建议/'([\^,]+,)' said Louis, '(*)'/，尽管我认为这只会匹配单个段落。这就是我的思考方式：

对于在该段落的第一行中包含文本“Said Louis”（或任何其他角色的名字）的每个段落，匹配每一行直到到达另一个角色的演讲，即“Said Rhodha”。

我可能可以用大量笨拙的python来做到这一点，但我很想知道这是否可以用正则表达式。

score 1 · Accepted Answer

从您的链接看来，文本遵循以下规则。

每条“线”确实是严格意义上的一条线，即用 .分隔\n。
段落由两个或多个连续的新行分隔，_i.e. \n\n+.
只有无方向的单引号'用于区分语音。

这是一个快速的尝试（一直向下滚动以查看比赛组） ——我敢肯定，有缺陷——但这里有足够的内容可以引导你朝着正确的方向前进。请注意，如果您连接三个捕获组（习惯上称为$1、$2和$3），您将如何获得每个字符的语音，包括“所说”分隔符之间的标点符号。但是，请注意语言的某些怪癖如何摆脱这个正则表达式 - 例如，我们不会在段落末尾关闭引号，但如果演讲继续到下一段，则打开新引号，这会破坏整个平衡 -引用策略——撇号也是如此。

\n\n.*?'([^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:([.])  |, )'([^^]+?)'(?=[^']*(?:'[^']')*[^']*\n\n.*'(?:[^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:[.]  |, ))
|   |  | <----><--> <>|<-------------------><------------>| <----> |<--------------------------------------------------------------------------------->
|   |  | |     |    | ||                    |             | |      ||
|   |  | |     |    | ||                    |             | |      |assert that this end-quote is followed by a string of non-quote characters, then
|   |  | |     |    | ||                    |             | |      |zero or more strings of quoted non-quote characters, then another string of non-
|   |  | |     |    | ||                    |             | |      |quote characters, a new paragraph, and the next "said Bernard"; otherwise fail.
|   |  | |     |    | ||                    |             | |      |
|   |  | |     |    | ||                    |             | |      match an (end-)quote
|   |  | |     |    | ||                    |             | |
|   |  | |     |    | ||                    |             | match any character as needed (but no more than needed)
|   |  | |     |    | ||                    |             |
|   |  | |     |    | ||                    |             match a (start-)quote
|   |  | |     |    | ||                    |
|   |  | |     |    | ||                    match either a period followed by two spaces, or a comma followed by one space
|   |  | |     |    | ||
|   |  | |     |    | |match the "said Bernard"
|   |  | |     |    | |
|   |  | |     |    | match an (end-)quote
|   |  | |     |    |
|   |  | |     |    match a comma, optionally
|   |  | |     |
|   |  | |     match a question mark, optionally
|   |  | |
|   |  | match any character as needed (but no more than needed)
|   |  |
|   |  match a (start-)quote
|   |
|   match as many non-newline characters as needed (but no more than needed)
|
new paragraph

Rubular 匹配（摘录）：

Match 3

1.  But when we sit together, close
2.   
3.  we melt into each
    other with phrases. We are edged with mist. We make an
    unsubstantial territory.

Match 4

1.  I see the beetle
2.  .
3.  It is black, I see; it is green,
    I see; I am tied down with single words. But you wander off; you
    slip away; you rise up higher, with words and words in phrases.

regex - 正则表达式捕捉弗吉尼亚伍尔夫小说 The Waves 中的对话？

1 回答 1

Related

Reference