我真的想不出一个像样的标题来概述我正在尝试做的事情,但是我的示例应该很好地解释它,我的公司在网上提供了一个时间表,但他们没有任何 API 或任何东西提取它,所以我使用 Python 框架 Scrapy 来抓取数据,然后将其添加到我的 Google 日历中
一个女孩给了我一条正则表达式行来处理数据,因为它让我好几天都感觉很好,但我后来意识到它不能处理分班(很可能是因为我没有安排任何这样的时间)她没有看到一个的可能性)
我的正则表达式是
re.findall("""dow1'>(\w+)<\S+?>(\w+ \d+)</td>\s*<td class.*?tlHours'>(\d+).*?span>\s*(\d+)<span.*?ment'>(.*?)</spa.*?Meal: (.*?)</sp.*?start'>(\S+?)</spa.*?end'>(\S+?)<""", response.body)
示例数据:
这是一个正常的 8 小时一天,有一顿饭,处理得很好:
<tr>
<td class='dt'>
<span class='dow1'>Sunday</span>Dec 09
</td>
<td class='ScheduledDetails'valign='top'>
<div style="position:relative;">
<span class='tlHours'>8<span class='spart'> hrs</span> 0<span class='spart'> mins</span></span><span class='department'>Cashier</span><span class='meal'>Meal: 2pm - 3pm</span>
</div>
</td>
<td>
</td>
<td class='Schedunderlay'>
<div class='Sched'>
<div class='schedbar' style='left: 143px; width: 234px;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 226px;'>
<span class='start'>10am</span><span class='end'>7pm</span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 9px; width: 498px; display: none;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 490px;'>
<span class='start'><img src='/Images/Schedule/arrowLeft.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'></span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 508px; width: 216px; display: none;'>
<div class='schedbar_l_on'></div>
<div class='schedbar_m_on' style='width: 208px;'><span class='start'></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'><img src='/Images/Schedule/arrowRight.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
</div>
<div class='schedbar_r_on'></div>
</div>
</div>
</td>
<td> </td>
<td class='rightColDetails'>
<div class='AvailDetails' align='left' style='display: table-cell;'>
<span class='iefix'><b>Avail - All Day</b></span><br/>
<span style='font-size: 11px;'>Pref - All Day</span>
</div>
</td>
</tr>
这是一个拆分班次,两个四小时班次被一个空的 1 小时时段隔开(他们这样做是为了欺骗评分系统,两个有盖班次而不是一个班次):
<tr>
<td class='dt'>
<span class='dow1'>Thursday</span>Dec 13
</td>
<td class='ScheduledDetails' valign='top'>
<div style="position:relative;">
<span class='tlHours'>8<span class='spart'> hrs</span> 0<span class='spart'> mins</span></span><span class='department'>Cashier</span><span class='meal'>Meal: None</span>
</div>
</td>
<td> </td>
<td class='Schedunderlay'>
<div class='Sched'>
<div class='schedbar' style='left: 247px; width: 104px;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 96px;'>
<span class='start'>2pm</span><span class='end'>6pm</span>
</div><div class='schedbar_r'></div>
</div>
<div class='schedbar' style='left: 377px; width: 104px;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 96px;'>
<span class='start'>7pm</span> <span class='end'>11pm</span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 9px; width: 498px; display: none;'>
<div class='schedbar_l'></div><div class='schedbar_m' style='width: 490px;'>
<span class='start'><img src='/Images/Schedule/arrowLeft.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'></span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 508px; width: 216px; display: none;'>
<div class='schedbar_l_on'></div>
<div class='schedbar_m_on' style='width: 208px;'>
<span class='start'></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'><img src='/Images/Schedule/arrowRight.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
</div>
<div class='schedbar_r_on'></div>
</div>
</div>
</td>
<td> </td>
<td class='rightColDetails'>
<div class='AvailDetails' align='left' style='display: table-cell;'>
<span class='iefix'><b>Avail - All Day</b></span><br/><span style='font-size: 11px;'>Pref - All Day</span>
</div>
</td>
</tr>
重要的区别在于常规班次有一个开始和一个结束时间,而拆分班次有一个开始,结束,开始和结束......
我已经为此苦苦挣扎了大约五个小时......并且没有取得任何进展,我想如果我理解正则表达式我会有更多的运气......任何帮助都将不胜感激......