ruby - 如何从非英语字符串中提取主题标签？

Question

我正在使用此代码从我的 Rails 3.2.13 应用程序中的帖子中提取主题标签。我也在使用 Ruby 1.9.3。

hasy =/(?:\s|^)(?:#(?!(?:\d+|\w+?_|_\w+?)(?:\s|$)))(\w+)(?=\s|$)/i
tags =post.body.scan(hasy).join(',').split(',').map{|i| "#"+i}

该代码适用于英语单词，但对于其他语言，特别是阿拉伯语，它们不起作用。有没有人有想法来解决这个问题，因为我的网站使用了很多阿拉伯语文本。

score 2 · Accepted Answer

我建议查看 POSIX 字符类的 Regexp 文档。有几个可能适合您的需求。我建议[:graph:]作为起点，然后根据需要缩小范围。

从文档：

/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
/[[:blank:]]/ - Space or tab
/[[:cntrl:]]/ - Control character
/[[:digit:]]/ - Digit
/[[:graph:]]/ - Non-blank character (excludes spaces, control characters, and similar)
/[[:lower:]]/ - Lowercase alphabetical character
/[[:print:]]/ - Like [:graph:], but includes the space character
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
/[[:xdigit:]]/ - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)

Ruby 还支持以下非 POSIX 字符类：

/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation

出于您的目的，例如：

/\s(#[[:graph:]]+)/

将捕获您的两个示例字符串。前面的 Rubular 链接有示例。

score 2 · Accepted Answer

\w只会匹配 ASCII 字符。您可以在正则表达式中使用POSIX 括号表达式来匹配在 Unicode 中被视为字母字符的非 ASCII 字符。

str = "some text before #القاهرة more text here القاهرة #foobar"
str.scan(/#[[:alnum:]]+/)
# => ["#القاهرة", "#foobar"]

score 0 · Accepted Answer

0

[^\x20-\x7E]+将识别非 ASCII 字符。

于 2013-09-18T14:56:23.333 回答

ruby - 如何从非英语字符串中提取主题标签？

3 回答 3

Related

Reference