1

我们一群英语研究生正在学习弗吉尼亚伍尔夫的小说 The Waves中的对话,我一直在尝试在 TEI 中标记小说。为此,编写一个捕获对话框的正则表达式会很有用。值得庆幸的是,The Waves 非常有规律,几乎所有的对话都是这样的:

“现在他们都走了,”路易斯说。'我独自一人。他们已经进屋吃早餐了,”

但可以持续几段。我正在尝试编写一个正则表达式来匹配给定演讲者的所有段落。

这在Chris Foster 的博客文章中进行了简要讨论,他在其中提出了类似的建议/'([\^,]+,)' said Louis, '(*)'/,尽管我认为这只会匹配单个段落。这就是我的思考方式:

  • 对于在该段落的第一行中包含文本“Said Louis”(或任何其他角色的名字)的每个段落,匹配每一行直到到达另一个角色的演讲,即“Said Rhodha”。

我可能可以用大量笨拙的python来做到这一点,但我很想知道这是否可以用正则表达式。

4

1 回答 1

1

从您的链接看来,文本遵循以下规则。

  1. 每条“线”确实是严格意义上的一条线用 .分隔\n
  2. 段落由两个或多个连续的新行分隔,_i.e. \n\n+.
  3. 只有无方向的单引号'用于区分语音。

这是一个快速的尝试(一直向下滚动以查看比赛组) ——我敢肯定,有缺陷——但这里有足够的内容可以引导你朝着正确的方向前进。请注意,如果您连接三个捕获组(习惯上称为$1$2$3),您将如何获得每个字符的语音,包括“所说”分隔符之间的标点符号。但是,请注意语言的某些怪癖如何摆脱这个正则表达式 - 例如,我们不会在段落末尾关闭引号,但如果演讲继续到下一段,则打开新引号,这会破坏整个平衡 -引用策略——撇号也是如此。

\n\n.*?'([^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:([.])  |, )'([^^]+?)'(?=[^']*(?:'[^']')*[^']*\n\n.*'(?:[^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:[.]  |, ))
|   |  | <----><--> <>|<-------------------><------------>| <----> |<--------------------------------------------------------------------------------->
|   |  | |     |    | ||                    |             | |      ||
|   |  | |     |    | ||                    |             | |      |assert that this end-quote is followed by a string of non-quote characters, then
|   |  | |     |    | ||                    |             | |      |zero or more strings of quoted non-quote characters, then another string of non-
|   |  | |     |    | ||                    |             | |      |quote characters, a new paragraph, and the next "said Bernard"; otherwise fail.
|   |  | |     |    | ||                    |             | |      |
|   |  | |     |    | ||                    |             | |      match an (end-)quote
|   |  | |     |    | ||                    |             | |
|   |  | |     |    | ||                    |             | match any character as needed (but no more than needed)
|   |  | |     |    | ||                    |             |
|   |  | |     |    | ||                    |             match a (start-)quote
|   |  | |     |    | ||                    |
|   |  | |     |    | ||                    match either a period followed by two spaces, or a comma followed by one space
|   |  | |     |    | ||
|   |  | |     |    | |match the "said Bernard"
|   |  | |     |    | |
|   |  | |     |    | match an (end-)quote
|   |  | |     |    |
|   |  | |     |    match a comma, optionally
|   |  | |     |
|   |  | |     match a question mark, optionally
|   |  | |
|   |  | match any character as needed (but no more than needed)
|   |  |
|   |  match a (start-)quote
|   |
|   match as many non-newline characters as needed (but no more than needed)
|
new paragraph

Rubular 匹配(摘录):

Match 3

1.  But when we sit together, close
2.   
3.  we melt into each
    other with phrases. We are edged with mist. We make an
    unsubstantial territory.

Match 4

1.  I see the beetle
2.  .
3.  It is black, I see; it is green,
    I see; I am tied down with single words. But you wander off; you
    slip away; you rise up higher, with words and words in phrases.
于 2013-03-27T19:26:01.910 回答