我正在尝试抓取公共 Github 存储库(https://github.com/stlrda/redb_python/tree/master/python/DAGs),以便从每个文件中获取名称和日期时间。我在下面发布的代码可以工作,但不是所有时间。有时,当它运行该行时,我会收到一个 Index out of range 错误DAGs[counter]['age'] = x.find('.no-wrap')[0].attrs['datetime']
。我很困惑为什么这段代码有时会起作用,而有时却找不到日期时间。关于如何解决此问题以找到每次运行的日期时间的任何想法?
session = HTMLSession()
r = session.get('https://github.com/stlrda/redb_python/tree/master/python/DAGs')
div = r.html.find('tbody', first=True)
title = div.find('.content')
DAGs = []
#Grab the names of each DAG in the repo
for x in range((len(title))):
if x == 0:
continue
else:
info = {"name": title[x].text}
DAGs.append(info)
#Update the dictionary with the age of the DAG
gitTable = div.find('.js-navigation-item')
counter = 0
for x in gitTable:
DAGs[counter]['age'] = x.find('.no-wrap')[0].attrs['datetime']
# print (x.find('.no-wrap')[0].attrs['datetime'])
counter+=1
当代码失败时,这里是 gitTable 变量包含的内容:
[<Element 'tr' class=('js-navigation-item',)>,
<Element 'tr' class=('js-navigation-item',)>,
<Element 'tr' class=('js-navigation-item',)>,
<Element 'tr' class=('js-navigation-item',)>]
并且 gitTable 列表中这些项目之一的 html 是:
>>>gitTable[0].html
'<tr class="js-navigation-item">\n<td class="icon">\n<svg aria-label="file" class="octicon octicon-file" height="16" role="img" version="1.1" viewbox="0 0 12 16" width="12"><path d="M6 5H2V4h4v1zM2 8h7V7H2v1zm0 2h7V9H2v1zm0 2h7v-1H2v1zm10-7.5V14c0 .55-.45 1-1 1H1c-.55 0-1-.45-1-1V2c0-.55.45-1 1-1h7.5L12 4.5zM11 5L8 2H1v12h10V5z" fill-rule="evenodd"/></svg>\n<img alt="" class="spinner" height="16" src="https://github.githubassets.com/images/spinners/octocat-spinner-32.gif" width="16"/>\n</td>\n<td class="content">\n<span class="css-truncate css-truncate-target"><a class="js-navigation-open" href="/stlrda/redb_python/blob/master/python/DAGs/MigratetoPG_DAG.py" id="5554cd417ad3b8097206c9a0e81566d0-7416c3966dc565eb1b0115b89fa72116e4cc3ee6" title="MigratetoPG_DAG.py">MigratetoPG_DAG.py</a></span>\n</td>\n<td class="message">\n<span class="css-truncate css-truncate-target">\n</span>\n</td>\n<td class="age">\n<span class="css-truncate css-truncate-target"/>\n</td>\n</tr>'