python-3.x - 使用 Python 3 和 BeautifulSoup 进行特定的 HTML 解析

Question

我正在尝试解析以下链接右下角表格中的信息，该表格显示Current schedule submissions：

dnedesign.us.to/tables/

我能够将其解析为：

{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"15:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"16:30";s:7:"endTime";s:5:"18:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"14:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:7:"Tuesday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}

这是执行解析以获取上述内容的代码：

try:
    from urllib.request  import urlopen
except ImportError:
    from urllib2 import urlopen
    from bs4 import BeautifulSoup
url = 'http://dnedesign.us.to/tables/'
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")
for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            print(td.text[4:])

我正在尝试将其解析为以下内容：

Day:Tuesday    Starttime:14:30    Endtime:16:30
Day:Sunday     Starttime:12:30    Endtime:14:30
Day:Sunday     Starttime:12:30    Endtime:16:30
Day:Sunday     Starttime:12:30    Endtime:16:30
....
....

对桌子的其余部分依此类推。

我正在使用Python 3.6.9和Httpie 0.9.8。Linux Mint Cinnamon 19.1这是我的毕业设计，任何帮助将不胜感激，谢谢。尼尔·M。

score 1 · Accepted Answer

您可以使用正则表达式来解析格式良好的表数据，注意寻找空字符串：

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import re
from bs4 import BeautifulSoup

url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
data = []

for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
            data.append({cols[x]: cols[x+1] for x in range(0, len(cols), 2)})

for row in data[::-1]:
    row = {
        k: re.sub(
            r"[a-zA-Z]+", lambda x: x.group().capitalize(), "%s:%s" % (k, v)
        ) for k, v in row.items()
    }
    print("    ".join([row["Day"], row["startTime"], row["endTime"]]))

输出：

Day:Tuesday    Starttime:14:30    Endtime:16:30
Day:Sunday    Starttime:12:30    Endtime:14:30
Day:Sunday    Starttime:12:30    Endtime:16:30
Day:Sunday    Starttime:12:30    Endtime:16:30
Day:Sunday    Starttime:    Endtime:
Day:Sunday    Starttime:    Endtime:
Day:Sunday    Starttime:16:30    Endtime:18:30
Day:Sunday    Starttime:14:30    Endtime:15:30
Day:Sunday    Starttime:14:30    Endtime:16:30

第二阶段为您的格式规范创建字符串，但创建data列表以存储每行的列数据的键值对的中间步骤是工作的核心。

根据您将项目放入类的请求，您可以创建一个实例Schedule并填充相关字段，而不是使用字典：

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import re
from bs4 import BeautifulSoup


class Schedule: 
    def __init__(self, day, start, end): 
        self.day = day
        self.start = start 
        self.end = end 


url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
schedules = []

for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
            data = {cols[x]: cols[x+1] for x in range(0, len(cols), 2)}
            schedules.append(Schedule(data["Day"], data["startTime"], data["endTime"]))

for schedule in schedules:
    print(schedule.day, schedule.start, schedule.end)

python-3.x - 使用 Python 3 和 BeautifulSoup 进行特定的 HTML 解析

1 回答 1

Related

Reference