1

I am trying to write python script to pull data from a site and place it into a json string.

The site is http://mtc.sri.com/live_data/attackers/.

I have python pulling the source, but can't quite figure out the regex portion

When I use RegExr, this regex works:

</?table[^>]*>|</?tr[^>]*>|</?td[^>]*>|</?thead[^>]*>|</?tbody[^>]*>|</?font[^>]*>

But when I put it into the script, I get no match.

#!/usr/bin/python
import urllib2
import re

f = urllib2.urlopen("http://mtc.sri.com/live_data/attackers/")
out = f.read();

matchObj = re.match( r'</?table[^>]*>|</?tr[^>]*>|</?td[^>]*>|</?thead[^>]*>|</?tbody[^>]*>|</?font[^>]*>', out, re.M|re.I)

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"

Any idea why I am not getting the appropriate response?

Edit:

Per a suggestion below, I used:

matchObj = re.findall( r'</?(?:table|t[dr]|thead|tbody|font)[^>]*>', out, re.M|re.I)

for i in matchObj.pop():
    print i

However, this simply outputs:

<
/
t
a
b
l
e
>

Edit 2:

I was using .pop() on the matchObj for some reason. Took that off. Now I am getting alot more of a response, but I am just getting the tags, not the data inside. I infact do not care about the tags. I would prefer just the data.

matchObj = re.findall( r'</?(?:table|t[dr]|thead|tbody|font)[^>]*>', out, re.M|re.I)

for i in matchObj:
    print i

Output:

<table class="attackers">
<tr>
</tr>
<tr>
<td>
</td>
<td>
</td>
...
4

1 回答 1

3

re.match测试整个字符串。

如果字符串与模式不匹配,则返回 None;请注意,这与零长度匹配不同。

改为使用re.search

扫描字符串以查找正则表达式模式产生匹配的位置,并返回相应的 MatchObject 实例。如果字符串中没有位置与模式匹配,则返回 None;请注意,这与在字符串中的某个点找到零长度匹配不同。

我认为你也可以稍微缩短你的正则表达式:

</?(?:table|t[dr]|thead|tbody|font)[^>]*>

而且您应该只有一个匹配组,因为您的正则表达式中没有捕获组,并且一个匹配将是第一个匹配的模式。

如果您想获取全部,请使用re.findall,您将获得的结果将是匹配结果的列表。

于 2013-09-27T18:37:02.353 回答