I have following code structure from a website I wanna scrape.
<span class="blk">Society/Project: <b>Sai Sparsh</b></span>
<i class="blk">
Built-up Area: <b>1005 Sq.Ft.</b>
@ <i class="WebRupeesmall b mr_5 f14">Rs.</i>6109/sq.ft</i>
I am already scraping few data by the following code
properties = soup.findAll('a', title=re.compile('Bedroom'))
for eachproperty in properties:
print today,","+"http:/"+ eachproperty['href']+",", eachproperty.string+"," +",".join(re.findall("'([a-zA-Z0-9,\s]*)'", eachproperty['onclick']))
and my output is
2013-09-05 ,http://Residential-Apartment-Flat-in-Velachery-Chennai South-3-Bedroom-bhk-for-Sale-spid-E10766779, 3 Bedroom, Residential Apartment in Velachery,E10766779,9952946340,,Dealer,Bala
So for the above defined HTML sturcture I am trying to strip and get the output as follows
Sai Sparsh, 1005 Sq.Ft, 6109/sq.ft
and attach it to the already generating output(mentioned above). I have been breaking my head to navigate down the tree and use REGEX for it.
Update
Here is what I tried with the code
cname = soup.findAll('span', {'class':'blk'})
pmoney = soup.findAll('i',{'class':'blk'})
for eachproperty in cname:
for each in pmoney:
tey = re.sub('(\s{2,})', ' ', eachproperty.text)[17:]
ting = re.sub('([0-9,\s]*)', ' ', each.text)
print tey + ting
And my output is
Rams Jai Vignesh Built-up Area: 1050 Sq.Ft. @ Rs.5524/sq.ft
Shrudhi Homes Built-up Area: 1050 Sq.Ft. @ Rs.5524/sq.ft
Ashtalakshmi Homes Built-up Area: 1050 Sq.Ft. @ Rs.5524/sq.ft
Raj Flats Built-up Area: 1050 Sq.Ft. @ Rs.5524/sq.ft
But I want my output to not have 'Built-up Area:' ,' @ ', ' Rs '. So it should be just
Rams Jai Vignesh ,1050 ,5524
Shrudhi Homes ,1050 , 5524