I have the following code:
*** REST OF CODE OMITTED ***
try:
fullURL = blitzurl + movie
opener.open(blitzurl)
urllib2.install_opener(opener)
request = urllib2.Request(fullURL)
requestData = urllib2.urlopen(request)
htmlText = BeautifulSoup(requestData.read())
#panel = htmlText.find(match_class(["panelbox"]))
#table = htmlText.find("table", {"id" : "scheduletbl"})
print htmlText
blah....
except Exception, e:
print str(e)
print "ERROR: ERROR OCCURED IN MAIN"
I am trying to get the content of a table with id "scheduletbl" (which is inside a div with a class named "panelbox"
the html code looks like this:
*** REST OF CODE OMITTED ***
<div class="panelbox">
<!-- !!!! content here !!!!! -->
<table border="0" cellpadding="2" cellspacing="0" id="scheduletbl" width="95%">
<tr>
<td align="left" colspan="3">
VC = Special Cinema (Velvet Class)<br/>
VS = Special Cinema (Velvet Suite)<br>
DC = Special Cinema (Dining Cinema)<br/>
S = Special Cinema (Satin)<br/>
3D = in RealD 3D<br/>
4DX = 4DX Cinema
</br></td>
</tr>
<tr>
<td class="separator2" colspan="3"><strong>BLITZMEGAPLEX - PARIS VAN JAVA, BANDUNG</strong></td>
</tr>
<tr>
<td colspan="3"><img align="left" height="16" hspace="5" src="../img/ico_rss_schedule_white.gif" width="16"/><strong><a class="navlink" href="../rss/schedule.php">RSS- Paris van Java</a></strong></td>
</tr>
<tr>
<td class="separator">Â </td>
<td class="separator" colspan="2">TUESDAY, 24 SEPTEMBER 2013</td>
</tr>
<tr>
<td class="separator">Â </td>
<td class="separator" rel="2D" width="20%">
10:30Â Â Â
</td>
<td class="separator" width="30%">
<a class="navlink" href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-09-24&cinema=0100&movie=MOV1954&showtime=10:30&suite=N&movieformat=2D" target="_blank">Buy Tickets</a></td>
</tr></table></div></div>
<tr>
*** and more <tr> tags ***
*** REST OF CODE OMITTED ***
The problem that i am having is that, when i try to extract the content based on the div-id it gets cut off in the middle (i am guessing because improper closing tag).
The thing also happen when i try to extract the content based on (using its id). It will also gets cut off in the middle because there is a , where it is not suppose to be there.
What are the best way to solve this? I have no control about the data since it is scraped from some website.