1

我只对使用 beautifulsoup 提取从上午 12 点到晚上 11.59 的 3 小时 PSI 读数的所有值感兴趣。比如下午5点最新的82加粗文字。

网站示例位于http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours。谁能教我怎么做?提前致谢 !

    <!-- start content -->
    <h1 class="title" id="top">
        PSI Readings over the last 24 Hours</h1>
    <script type="text/javascript">
        var baseUrl = '/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours';

        function changetime(ddl) {
            var strTime = ddl.options[ddl.selectedIndex].value;

            if (strTime != null) {
                var npage = baseUrl + "/time/" + strTime + "#psi24";
                window.location = npage;
            }
        }
    </script>
    <h1 id="psi24">
        24-hr PSI Readings on 24 Jun 2013
    </h1>
    <p>
        View reading for:
        <select class="default" id="ContentPlaceHolderContent_C001_DDLTime" name="ctl00$ContentPlaceHolderContent$C001$DDLTime" onchange="changetime(this);">
    <option value="0000">12AM</option>
    <option value="0100">1AM</option>
    <option value="0200">2AM</option>
    <option value="0300">3AM</option>
    <option value="0400">4AM</option>
    <option value="0500">5AM</option>
    <option value="0600">6AM</option>
    <option value="0700">7AM</option>
    <option value="0800">8AM</option>
    <option value="0900">9AM</option>
    <option value="1000">10AM</option>
    <option value="1100">11AM</option>
    <option value="1200">12PM</option>
    <option value="1300">1PM</option>
    <option value="1400">2PM</option>
    <option value="1500">3PM</option>
    <option value="1600">4PM</option>
    <option selected="selected" value="1700">5PM</option>
    </select>
    </p>
    <table border="0" cellpadding="4" cellspacing="1" class="text_psinormal" width="100%">
    <thead>
    <tr>
    <th width="33%">
    <center><strong>Region</strong></center>
    </th>
    <th width="33%">
    <center><strong>PSI</strong></center>
    </th>
    <th width="34%">
    <center><strong>24-hr PM2.5 Concentration (µg/m<sup>3</sup>)</strong></center>
    </th>
    </tr>
    </thead>
    <tr>
    <td align="center">North
            </td>
    <td align="center">
                61
            </td>
    <td align="center">
                47
            </td>
    </tr>
    <tr>
    <td align="center">South
            </td>
    <td align="center">
                62
            </td>
    <td align="center">
                46
            </td>
    </tr>
    <tr>
    <td align="center">East
            </td>
    <td align="center">
                55
            </td>
    <td align="center">
                39
            </td>
    </tr>
    <tr>
    <td align="center">West
            </td>
    <td align="center">
                87
            </td>
    <td align="center">
                83
            </td>
    </tr>
    <tr>
    <td align="center">Central
            </td>
    <td align="center">
                58
            </td>
    <td align="center">
                40
            </td>
    </tr>
    <tr>
    <td align="center">Overall Singapore
            </td>
    <td align="center">
                55-87
            </td>
    <td align="center">
                39-83
            </td>
    </tr>
    </table>
    <div>
    </div>
    <div>
    <h1>3-hr PSI Readings from 12AM to 11.59PM on
                            24 Jun 2013</h1>
    <table border="0" cellpadding="4" cellspacing="1" width="100%">
    <tr>
    <td align="center" width="16%">
    <strong>Time</strong>
    </td>
    <td align="center" width="7%"><strong>12AM</strong>
    </td>
    <td align="center" width="7%"><strong>1AM</strong>
    </td>
    <td align="center" width="7%"><strong>2AM</strong>
    </td>
    <td align="center" width="7%"><strong>3AM</strong>
    </td>
    <td align="center" width="7%"><strong>4AM</strong>
    </td>
    <td align="center" width="7%"><strong>5AM</strong>
    </td>
    <td align="center" width="7%"><strong>6AM</strong>
    </td>
    <td align="center" width="7%"><strong>7AM</strong>
    </td>
    <td align="center" width="7%"><strong>8AM</strong>
    </td>
    <td align="center" width="7%"><strong>9AM</strong>
    </td>
    <td align="center" width="7%"><strong>10AM</strong>
    </td>
    <td align="center" width="7%"><strong>11AM</strong>
    </td>
    </tr>
    <tr>
    <td align="center">
    <strong>3-hr PSI</strong>
    </td>
    <td align="center">
                        76
                    </td>
    <td align="center">
                        70
                    </td>
    <td align="center">
                        64
                    </td>
    <td align="center">
                        59
                    </td>
    <td align="center">
                        54
                    </td>
    <td align="center">
                        51
                    </td>
    <td align="center">
                        48
                    </td>
    <td align="center">
                        47
                    </td>
    <td align="center">
                        47
                    </td>
    <td align="center">
                        47
                    </td>
    <td align="center">
                        49
                    </td>
    <td align="center">
                        52
                    </td>
    </tr>
    <tr>
    <td align="center" width="16%">
    <strong>Time</strong>
    </td>
    <td align="center" width="7%"><strong>12PM</strong>
    </td>
    <td align="center" width="7%"><strong>1PM</strong>
    </td>
    <td align="center" width="7%"><strong>2PM</strong>
    </td>
    <td align="center" width="7%"><strong>3PM</strong>
    </td>
    <td align="center" width="7%"><strong>4PM</strong>
    </td>
    <td align="center" width="7%"><strong>5PM</strong>
    </td>
    <td align="center" width="7%"><strong>6PM</strong>
    </td>
    <td align="center" width="7%"><strong>7PM</strong>
    </td>
    <td align="center" width="7%"><strong>8PM</strong>
    </td>
    <td align="center" width="7%"><strong>9PM</strong>
    </td>
    <td align="center" width="7%"><strong>10PM</strong>
    </td>
    <td align="center" width="7%"><strong>11PM</strong>
    </td>
    </tr>
    <tr>
    <td align="center">
    <strong>3-hr PSI</strong>
    </td>
    <td align="center">
                        54
                    </td>
    <td align="center">
                        59
                    </td>
    <td align="center">
                        65
                    </td>
    <td align="center">
                        72
                    </td>
    <td align="center">
                        79
                    </td>
    <td align="center">
    <strong style="font-size:14px;">82</strong>
    </td>
    <td align="center">
                        -
                    </td>
    <td align="center">
                        -
                    </td>
    <td align="center">
                        -
                    </td>
    <td align="center">
                        -
                    </td>
    <td align="center">
                        -
                    </td>
    <td align="center">
                        -
                    </td>
    </tr>
    </table>
    </div>
    <div class="sfContentBlock">
    <p class="table-caption">Hourly updates of 3-hr PSI readings are provided from 12am to 11:59pm. The 3hr PSI readings are calculated based on PM10 concentrations only</p>
    </div>
    <div>
    </div>
    <div class="backToTop">
    <a href="#top">Back to Top</a>
    </div>
    </div>
    </div>
    <!-- end content -->
4

1 回答 1

0

虽然您应该已经证明您已经尝试过自己做,但这里是代码:

from pprint import pprint
import urllib2
from bs4 import BeautifulSoup as soup


url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))

table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]

table_rows = []
for row in table.find_all('tr'):
    table_rows.append([td.text.strip() for td in row.find_all('td')])

data = {}
for tr_index, tr in enumerate(table_rows):
    if tr_index % 2 == 0:
        for td_index, td in enumerate(tr):
            data[td] = table_rows[tr_index + 1][td_index]

pprint(data)

印刷:

{'10AM': '49',
 '10PM': '-',
 '11AM': '52',
 '11PM': '-',
 '12AM': '76',
 '12PM': '54',
 '1AM': '70',
 '1PM': '59',
 '2AM': '64',
 '2PM': '65',
 '3AM': '59',
 '3PM': '72',
 '4AM': '54',
 '4PM': '79',
 '5AM': '51',
 '5PM': '82',
 '6AM': '48',
 '6PM': '79',
 '7AM': '47',
 '7PM': '-',
 '8AM': '47',
 '8PM': '-',
 '9AM': '47',
 '9PM': '-',
 'Time': '3-hr PSI'}
于 2013-06-24T10:40:56.383 回答