ruby - How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

Question

I want to extract #hashtags from a string, also those that have special characters such as #1+1.

Currently I'm using:

@hashtags ||= string.scan(/#\w+/)

But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.

How do I do this?

EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...

Also, the hash sign at the beginning should be removed.

score 1 · Accepted Answer

解决方案

你可能想要这样的东西：

'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]

请注意，开始时需要零宽度断言以避免将井号捕获为匹配的一部分。

参考

score 0 · Accepted Answer

这个怎么样：

@hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]

照顾 #alphabets 或 #2323+2323 #2323-2323 #2323+65656-67676

还删除开头的#

或者，如果您希望它以数组形式出现：

 @hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}

哇，这花了这么长时间，但我仍然不明白为什么能scan(/#[[:alpha:]]+|#[\d\+-]+\d+/)工作，但不能scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/)在我的电脑上工作。不同之处在于()第二次扫描语句。当我使用 withmatch方法时，这没有任何效果。

score 0 · Accepted Answer

这应该有效：

@hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten

或者，如果您不希望主题标签以特殊字符开头：

@hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten

ruby - How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

3 回答 3

解决方案

参考

Related

Reference