0

我有一个这样的字符串:

src="http://www.google.com/calendar/embed?showTitle=0&mode=WEEK&height=600&wkst=1&bgcolor=%23FFFFFF&src= 59flluvbaj110hp6ht5hrveof8%40group.calendar.google.com &color=%23B1365F&src=cnuvtn9nofljk5kq9381icgroup.endg% google.com&color=%232952A3&ctz=America%2FNew_York" style="border-width:0" width="800" height="600" frameborder="0" scrolling="no"

我想以粗体提取部分。它总是在 asrc=和之间&。目前,我正在做

"sample string above".match(/;src.*?&/)[0][5, length-5]

但这似乎真的很不雅。有一个更好的方法吗?

4

4 回答 4

2
"sample string above"[/&src=(.*?)&/, 1]

1 表示第一个捕获组

于 2013-01-21T05:07:34.037 回答
1

您根本不需要正则表达式,只需了解发生了什么。问题在于 的内容src已针对 HTML 实体进行了编码,因此,&在变量之间嵌入了编码到&.

解决方法是首先解码字符串以反转编码,然后将字符串拆分回其组件。你可以这样做:

require 'cgi'
require 'uri'

uri = URI.parse(src)
hash = Hash[URI::decode_www_form(CGI::unescapeHTML(uri.query))]
hash['src'] # => "cnuvtn9nofljk5kq9381ic5odg@group.calendar.google.com"

将查询解码为哈希的替代方法是:

hash = Hash[CGI::unescapeHTML(uri.query).split('&').map{ |q| q.split('=') }]

通过拆分&=我们得到一个数组数组,并且可以轻松地将其转换回哈希,从而可以轻松访问字符串中的任何变量。

虽然这些看起来像是一条更长的路径,但它们解决了问题并将值返回到其原始形式。

通常我们希望它作为一个散列,但在这种情况下,我们不能这样做,因为它们"src"在查询中有两个参数,导致第二个参数踩到第一个参数。如果您想要第一个而不是第二个,则需要在不转换为哈希的情况下获取它:

URI::decode_www_form(CGI::unescapeHTML(uri.query)).select{ |k,v| k == 'src' }
=> [["src", "*59flluvbaj110hp6ht5hrveof8@group.calendar.google.com*"], ["src", "cnuvtn9nofljk5kq9381ic5odg@group.calendar.google.com"]]

URI::decode_www_form(CGI::unescapeHTML(uri.query)).select{ |k,v| k == 'src' }[0]
=> ["src", "*59flluvbaj110hp6ht5hrveof8@group.calendar.google.com*"]

URI::decode_www_form(CGI::unescapeHTML(uri.query)).select{ |k,v| k == 'src' }[1]
=> ["src", "cnuvtn9nofljk5kq9381ic5odg@group.calendar.google.com"]

您显示的字符串虽然看起来不正确,但它看起来像是您从 HTML 中剪切和粘贴的内容。如果是这样,您应该使用解析器来提取内容,而不是正则表达式。而且,在这种情况下,这是正确的方法:

require 'nokogiri'

html = '<img src="http://www.google.com/calendar/embed?showTitle=0&mode=WEEK&height=600&wkst=1&bgcolor=%23FFFFFF&src=59flluvbaj110hp6ht5hrveof8%40group.calendar.google.com&color=%23B1365F&src=cnuvtn9nofljk5kq9381ic5odg%40group.calendar.google.com&color=%232952A3&ctz=America%2FNew_York" style=" border-width:0 " width="800" height="600" frameborder="0" scrolling="no">'

doc = Nokogiri.HTML(html)
src = doc.at('img')['src']
=> "http://www.google.com/calendar/embed?showTitle=0&mode=WEEK&height=600&wkst=1&bgcolor=%23FFFFFF&src=59flluvbaj110hp6ht5hrveof8%40group.calendar.google.com&color=%23B1365F&src=cnuvtn9nofljk5kq9381ic5odg%40group.calendar.google.com&color=%232952A3&ctz=America%2FNew_York"

Nokogiriat方法 doc.at('img')可能会根据标签在文档中的位置而改变<img>,但处理它是一个单独的问题。

于 2013-01-21T06:40:43.870 回答
0

Fix Your Quote Delimiters

Your string, as originally posted, has quoting issues. Make sure you escape your string properly. For example, you might use this alternate syntax:

src = %q{http://www.google.com/calendar/embed?showTitle=0&mode=WEEK&height=600&wkst=1&bgcolor=%23FFFFFF&src=59flluvbaj110hp6ht5hrveof8%40group.calendar.google.com&color=%23B1365F&src=cnuvtn9nofljk5kq9381ic5odg%40group.calendar.google.com&color=%232952A3&ctz=America%2FNew_York" style=" border-width:0 " width="800" height="600" frameborder="0" scrolling="no"}

Using Positive Lookbehind

You can use a positive lookbehind assertion to scan your string for all matches, and then use an appropriate Array method to access the one you're interested in. For example:

src.scan(/(?<=src=)[^&]+/).first
# => "59flluvbaj110hp6ht5hrveof8%40group.calendar.google.com"
于 2013-01-21T07:18:43.143 回答
0

您可以使用捕获组来执行此操作,如下所示:

"sample string above".sub(/^.*src=(.*?)&.*$/, '\1')
于 2013-01-21T06:39:47.070 回答