假设我有:
<tr>
<td class="prodSpecAtribute">word</td>
<td colspan="5">
<a href="http://www.cmegroup.com/clearing/trading-practices/CMEblock-trade.html" target="_blank">another_word</a>
</td>
</tr>
我想在 2 个 td 类中提取文本(word 和 another_word:所以我使用了 BeautifulSoup:
这是 Matijn Pieters 要求的代码:基本上,它从 html 页面(从表中)获取信息并将这些值存储在左右列列表中。然后,我根据这些详细信息创建一个字典(使用左列列表作为键,对于值,我使用右列列表)
def get_data(page):
soup = BeautifulSoup(page)
left = []
right = []
#Obtain data from table and store into left and right columns
#Iterate through each row
for tr in soup.findAll('tr'):
#Find all table data(cols) in that row
tds = tr.findAll('td')
#Make sure there are 2 elements, a col and a row
if len(tds) >= 2:
#Find each entry in a row -> convert to text
right_col = []
inp = []
once = 0
no_class = 0
for td in tds:
if once == 0:
#Check if of class 'prodSpecAtribute'
if check(td) == True:
left_col = td.findAll(text=True)
left_col_x = re.sub('&\w+;', '', str(left_col[0]))
once = 1
else:
no_class = 1
break
else:
right_col = td.findAll(text=True)
right_col_x = ' '.join(text for text in right_col if text.strip())
right_col_x = re.sub('&\w+;', '', right_col_x)
inp.append(right_col_x)
if no_class == 0:
inps = '. '.join(inp)
left.append(left_col_x)
right.append(inps)
#Create a Dictionary for left and right cols
item = dict(zip(left, right))
return item