1

I am trying to extract an url from content using yahoo pipes but for that I need to match everything before the url, and everything after :

<div class="medium mode player"><div class="info-header"><a rel="nofollow" target="_blank" 
href="http://i1.sndcdn.com/artworks-000059185212-dsb68g-crop.jpg?3eddc42" class="artwork" 
style="background:url(http://i1.sndcdn.com/artworks-000059185212-dsb68g-badge.jpg?
3eddc42);">Dream ft. Notorious BIG Artwork</a> <h3><a rel="nofollow" target="_blank" 
href="http://soundcloud.com/tom-misch/dream-ft-notorious-big">Dream ft. Notorious BIG</a>
</h3> <span class="subtitle"><span class="user tiny online"><a rel="nofollow" 
target="_blank" href="http://soundcloud.com/tom-misch" class="user-name">Tom Misch</a>

The url I want is that one : http://soundcloud.com/tom-misch/dream-ft-notorious-big

I tried to learn a bit about regex but when I think I understand, nothing I try works

Hope some of you can help me on that guys ! cheers

4

1 回答 1

1

This probably will do, it only matches URLs from soundcloud, that uses the http protocol and have no subdomain, the group will capture the full url so that you can use it, and it uses a lazy quantifier to match up to the first quote:

(http://soundcloud.*?)"

Here is an alternative:, that does not uses a lazy quatifier, instead it uses a negated class to match anything but a quote:

(http://soundcloud[^"]+)

Keep in mind that both regexs will actually match both URLs, depending on the library and the flags that you use it might return only the first occurrence or both, you can just use the first one or further check the results for the correct format.

If you really want to use just a regex and your regex library supports look-ahead, you can do this:

(http://soundcloud.*?)\s+(?!class="user-name")

The look-ahead (?!= will not match if the string that follows is class="user-name"


I didn't too, find what library yahoo pipes uses, if you want to replace everything around the url, you can change the regex to:

^.*?(http://soundcloud[^"]+).*$

And use $1 in the replacement string to get the url back (keep in mind that I mixed .*? with [^"]+, that's because I want to replace the whole string with the first url and not the second one, so I need the first .* to match up to the point of the first url and stop, that's what the lazy quantifier if for).

于 2013-11-12T01:03:43.823 回答