I am trying to write python script to pull data from a site and place it into a json string.
The site is http://mtc.sri.com/live_data/attackers/.
I have python pulling the source, but can't quite figure out the regex portion
When I use RegExr, this regex works:
</?table[^>]*>|</?tr[^>]*>|</?td[^>]*>|</?thead[^>]*>|</?tbody[^>]*>|</?font[^>]*>
But when I put it into the script, I get no match.
#!/usr/bin/python
import urllib2
import re
f = urllib2.urlopen("http://mtc.sri.com/live_data/attackers/")
out = f.read();
matchObj = re.match( r'</?table[^>]*>|</?tr[^>]*>|</?td[^>]*>|</?thead[^>]*>|</?tbody[^>]*>|</?font[^>]*>', out, re.M|re.I)
if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
else:
print "No match!!"
Any idea why I am not getting the appropriate response?
Edit:
Per a suggestion below, I used:
matchObj = re.findall( r'</?(?:table|t[dr]|thead|tbody|font)[^>]*>', out, re.M|re.I)
for i in matchObj.pop():
print i
However, this simply outputs:
<
/
t
a
b
l
e
>
Edit 2:
I was using .pop() on the matchObj for some reason. Took that off. Now I am getting alot more of a response, but I am just getting the tags, not the data inside. I infact do not care about the tags. I would prefer just the data.
matchObj = re.findall( r'</?(?:table|t[dr]|thead|tbody|font)[^>]*>', out, re.M|re.I)
for i in matchObj:
print i
Output:
<table class="attackers">
<tr>
</tr>
<tr>
<td>
</td>
<td>
</td>
...