regex - 在 R 中使用正则表达式获取 Twitter @Username

Question

如何在 R 中使用正则表达式从文本字符串中提取 Twitter 用户名？

我试过了

library(stringr)

theString <- '@foobar Foobar! and @foo (@bar) but not foo@bar.com'

str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)')

但我最终得到@foobar,@foo并且(@bar包含一个不需要的括号。

我怎样才能得到 just和@foobar作为输出？@foo@bar

score 8 · Accepted Answer

这是一种适用于的方法R：

theString <- '@foobar Foobar! and @foo (@bar) but not foo@bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex <- "(^|[^@\\w])@(\\w{1,15})\\b"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo"    "(@bar)"

如果您想在以下位置使用@Jerry 的答案R：

regex <- "@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo"    "(@bar)"

但是，这两种方法都包含您不想要的括号。

更新这将使您从头到尾没有括号或任何其他类型的标点符号（下划线除外，因为它们允许在用户名中使用）

theString <- '@foobar Foobar! and @fo_o (@bar) but not foo@bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex1 <- "(^|[^@\\w])@(\\w{1,15})\\b" # get strings with @
regex2 <- "[^[:alnum:]@_]"             # remove all punctuation except _ and @
users <- gsub(regex2, "", theString1[grep(regex1, theString1, perl = T)])
users

[1] "@foobar" "@fo_o"   "@bar"

score 2 · Accepted Answer

@[a-zA-Z0-9_]{0,15}

在哪里：

@从字面上匹配字符@（区分大小写）。
[a-zA-Z0-15]匹配列表中存在的单个字符
{0,15}量词匹配 0 到 15 次，尽可能多次，根据需要回馈

从混合数据集中选择 twitter 用户名工作正常。

score 1 · Accepted Answer

尝试使用否定的lookbehind，这样字符就不会在你的匹配中被消耗：

(?:^|(?<![-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)
      ^^^

编辑：由于看起来后向在 R 中不起作用（我在这里发现后向在 R 上起作用，但显然不是......），试试这个：

@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)

编辑：双转义点

EDITv3 ...：尝试打开PCRE：

str_extract_all(string=theString,perl("(?:^|(?<![-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)")

regex - 在 R 中使用正则表达式获取 Twitter @Username

3 回答 3

Related

Reference