regex - Python Regex for parsing site

Question

I am trying to write python script to pull data from a site and place it into a json string.

The site is http://mtc.sri.com/live_data/attackers/.

I have python pulling the source, but can't quite figure out the regex portion

When I use RegExr, this regex works:

But when I put it into the script, I get no match.

#!/usr/bin/python
import urllib2
import re

f = urllib2.urlopen("http://mtc.sri.com/live_data/attackers/")
out = f.read();

matchObj = re.match( r'</?table[^>]*>|</?tr[^>]*>|</?td[^>]*>|</?thead[^>]*>|</?tbody[^>]*>|</?font[^>]*>', out, re.M|re.I)

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"

Any idea why I am not getting the appropriate response?

Edit:

Per a suggestion below, I used:

matchObj = re.findall( r'</?(?:table|t[dr]|thead|tbody|font)[^>]*>', out, re.M|re.I)

for i in matchObj.pop():
    print i

However, this simply outputs:

<
/
t
a
b
l
e
>

Edit 2:

I was using .pop() on the matchObj for some reason. Took that off. Now I am getting alot more of a response, but I am just getting the tags, not the data inside. I infact do not care about the tags. I would prefer just the data.

matchObj = re.findall( r'</?(?:table|t[dr]|thead|tbody|font)[^>]*>', out, re.M|re.I)

for i in matchObj:
    print i

Output:

<table class="attackers">
<tr>
</tr>
<tr>
<td>
</td>
<td>
</td>
...

score 3 · Accepted Answer

re.match测试整个字符串。

如果字符串与模式不匹配，则返回 None；请注意，这与零长度匹配不同。

改为使用re.search。

扫描字符串以查找正则表达式模式产生匹配的位置，并返回相应的 MatchObject 实例。如果字符串中没有位置与模式匹配，则返回 None；请注意，这与在字符串中的某个点找到零长度匹配不同。

我认为你也可以稍微缩短你的正则表达式：

</?(?:table|t[dr]|thead|tbody|font)[^>]*>

而且您应该只有一个匹配组，因为您的正则表达式中没有捕获组，并且一个匹配将是第一个匹配的模式。

如果您想获取全部，请使用re.findall，您将获得的结果将是匹配结果的列表。

regex - Python Regex for parsing site

1 回答 1

Related

Reference