regex - R正则表达式：在脚本中提取扬声器

Question

我想使用 R 从脚本中提取扬声器，格式如下例所示：

“场景 6：二爷：不，我的大人，别管他了；让他为所欲为。大爷：如果大人发现他不是藏身之处，请不要再尊重我。二爷：我的生命，大人，一个泡沫。伯特伦：你认为我对他有这么大的欺骗吗？第二大人：相信它，大人，我自己直接知道，没有任何恶意，但说他是我的亲戚，他是一个最出名的懦夫，一个无穷无尽的骗子，一个按时不守诺言的人，一个没有值得大人赏识的好品质的主人。”

在这个例子中，我想提取：（“Second Lord”，“First Lord”，“Second Lord”，“BERTRAM”，“Second Lord”）。规则很明显：它是位于句子末尾和半列之间的一组词。

我怎么能用 R 写这个？

score 4 · Accepted Answer

也许是这样的：

library(stringr)  
body <- "Scene 6: Second Lord: Nay, good my lord, put him to't; let him have his way. First Lord: If your lordship find him not a hilding, hold me no more in your respect. Second Lord: On my life, my lord, a bubble. BERTRAM: Do you think I am so far deceived in him? Second Lord: Believe it, my lord, in mine own direct knowledge, without any malice, but to speak of him as my kinsman, he's a most notable coward, an infinite and endless liar, an hourly promise-breaker, the owner of no one good quality worthy your lordship's entertainment." 
p <- str_extract_all(body, "[:.?] [A-z ]*:")

# and get rid of extra signs
p <- str_replace_all(p[[1]], "[?:.]", "")
# strip white spaces
p <- str_trim(p)
p
"Second Lord" "First Lord"  "Second Lord" "BERTRAM"     "Second Lord"

# unique players
unique(p)
[1] "Second Lord" "First Lord"  "BERTRAM"

正则表达式的解释：（不完美）

str_extract_all(body, "[:.?] [A-z ]*:")匹配以:or.或?( [:.?]) 开头，后跟空格。任何字符和空格都会匹配到下一个:。

获取位置

您可以使用str_locate_all相同的正则表达式：

str_locate_all(body, "[:.?] [A-z ]*:")

score 3 · Accepted Answer

gsubfn/strapplyc

试试这个x输入字符串在哪里。这里strapplyc返回括号内匹配的部分：

> library(gsubfn)
> strapplyc(x, "[.?:] *([^:]+):", simplify = c)
[1] "Second Lord" "First Lord"  "Second Lord" "BERTRAM"     "Second Lord"

聚合表达式

这是第二种方法。它不使用外部包。这里我们计算开始和结束位置（start.pos和end.pos），然后提取出它们定义的字符串：

> pos <- gregexpr("[.?:] [^:]+:", x)[[1]]
> start.pos <- pos + 2
> end.pos <- start.pos + attr(pos, "match.length") - 4
> substring(x, start.pos, end.pos)
[1] "Second Lord" "First Lord"  "Second Lord" "BERTRAM"     "Second Lord"

score 2 · Accepted Answer

至少在这种情况下，更好的解决方案是以更结构化的形式搜索文本。挖掘结构化文档几乎总是比非结构化文档容易。由于来源是莎士比亚，因此互联网上流传着许多副本。

script_url <- "http://www.opensourceshakespeare.org/views/plays/play_view.php?WorkID=allswell&Act=3&Scene=6&Scope=scene"
doc <- htmlParse(script_url)
character_links <- xpathApply(doc, '//li[@class="playtext"]/strong/a')
characters <- unique(sapply(character_links, xmlValue))
#[1] "Second Lord" "First Lord"  "Bertram"     "Parolles"

请注意，您使用的文本版本有很大的不同。开源莎士比亚非常好，因为 html 页面结构良好并包含类。另一方面，Bartleby 页面不是。让我们再次运行分析：

script_url2 <- "http://www.bartleby.com/70/2236.html"
doc2 <- htmlParse(script_url2)
tbl <- xpathApply(doc2, '//table[@width="100%"]')[[1]]
italics <- xpathApply(tbl, '//tr/td/i')
characters2 <- unique(sapply(italics, xmlValue))
#[1] "First Lord." "Sec. Lord."  "Ber."        "Par."        "hic jacet."  "Exit."      
#[7] "Ber"         "Exeunt."

在这种情况下，您无法以编程方式区分角色、舞台方向（无需编制可能的舞台方向列表并忽略它们）和强调的语音。明智地选择您的文本来源！

regex - R正则表达式：在脚本中提取扬声器

3 回答 3

正则表达式的解释：（不完美）

获取位置

Related

Reference