php - 谁能破解这个 twitter 正则表达式？

Question

我想使用 PHP 从http://search.twitter.com/search.atom?q=%23eu-jele%C4%A1%C4%A1i获取所有主题标签

主题标签位于 RSS 提要中的内容、标题节点中。它们以 # 为前缀

我遇到的问题是非英文字母（在 a-zA-Z 范围之外）。

如果您查看 RSS 提要，然后查看 html 源代码，我的挣扎可能会更清楚。

    <title>And more: #eu-jele&#289;&#289;i #eu-kiest #ue-wybiera #eu-eleger #ue-alege #eu-vyvolenej #eu-izvoli #eu-elegir #eu-v&#228;lja #eu-elect</title>

在找到我的 rexexp 匹配项之前，我是否需要对标题节点做一些事情。

我的最终目标是用 twitter 搜索 url 替换主题标签，例如http://search.twitter.com/search.atom?q=%23eu-jele%C4%A1%C4%A1i

这里有一些示例代码可以帮助您。


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<body>
<?php 
$title="And more: #eu-jele&#289;&#289;i #eu-kiest #ue-wybiera #eu-eleger #ue-alege #eu-vyvolenej #eu-izvoli #eu-elegir #eu-v&#228;lja #eu-elect";

// this is the regexp that hashtags.org use (http://twitter.pbwiki.com/Hashtags)
$r = preg_replace("/(?:(?:^#|[\s\(\[]#(?!\d\s))(\w+(?:[_\-\.\+\/]\w+)*)+)/"," <a href=\"http://search.twitter.com/search?q=%23\1\">\1</a> ", $title);
echo "<p>$r</p>";

$r = preg_replace("/(#.+?)(?:(\s|$))/"," <a href=\"http://search.twitter.com/search?q=\1\">\1</a> ", $title);
echo "<p>$r</p>";

// This is my desired end result
echo "<p><a href=\"http://search.twitter.com/search?q=%23eu-jeleġġi\">#eu-jeleġġi</a></p>";
?>

</body>
</html>

任何建议或解决方案将不胜感激。

score 9 · Accepted Answer

9

要不就

(#\S+)

于 2009-03-27T01:07:17.950 回答

score 3 · Accepted Answer

如果您需要 Twitter 用来呈现主题标签的确切正则表达式，Twitter 会在这个开源库中提供它以及链接、提及等模式。

标签匹配模式

(^|[^0-9A-Z&/]+)(#|\uFF03)([0-9A-Z_]*[A-Z_]+[a-z0-9_\\u00c0-\\u00d6\\u00d8-\\u00f6\\u00f8-\\u00ff]*)

上面的模式可以从这个java 文件中拼凑起来。此模式的验证测试位于此文件中的第 115 行附近。

score 1 · Accepted Answer

抓住一个 '#' 加上所有字符，直到你碰到一个空白字符：

(#.+?)(?:\s)

或者更灵活一点（允许字符串结尾）：

(#.+?)(?:(\s|$))

score 1 · Accepted Answer

heres what i would use :)

(?<![^\s#])(#[^\s#]+)(?=(\s|$))

example matching on this string

#test #test#test #test-test test#test

hope this is helpful

score 0 · Accepted Answer

你为什么使用正则表达式？删除前面没有散列的任何内容，然后按散列分解。正则表达式似乎不必要地复杂且不适合该问题。

也许您可以进一步解释为什么这需要在正则表达式中完成？

php - 谁能破解这个 twitter 正则表达式？

5 回答 5

标签匹配模式

Related

Reference