0

使用 Python 2.5,我正在读取三个不同信息的 HTML 文件。我能够找到信息的方法是找到与regex * 的匹配项,然后从匹配行向下计数特定数量的行以获得我正在寻找的实际信息。 问题是我必须重新打开网站 3 次(我正在查找的每条信息一个)。我认为它效率低下,并且希望能够仅查找一次打开该站点的所有三件事。 有没有人有更好的方法或建议?

*我会学习更好的方法,比如 BeautifulSoup,但现在,我需要快速修复

代码:

def scrubdividata(ticker):
try:
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            end = '</td>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass

谢谢,

我找到了一个可行的解决方案!我删除了两个无关的 urlopen 和 readlines 命令,只留下一个用于循环(之前我只删除了 urlopen 命令,但留下了 readlines)。这是我更正的代码:

def scrubdividata(ticker):
try:
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    #f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    #lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    #f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    #lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            end = '</td>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
    print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
print ticker,LastDiv,AnnualDiv,LastExDivDate
print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass
4

3 回答 3

1

BeautifulSoup 示例供参考(内存中的 Python2:我这里只有 Python3,所以一些语法可能有点偏离):

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

yoursite = "http://...."
with urlopen(yoursite) as f:
    soup = BeautifulSoup(f)

    for node in soup.findAll('td', attrs={'class':'descrip'}):
        print node.text
        print node.next_sibling.next_sibling.text

输出(对于样本输入“GOOG”):

Last Close:
$910.68
Annual Dividend:
N/A
Pay Date:
N/A
Dividend Yield:
N/A
Ex-Dividend Date:
N/A
Years Paying:
N/A
52 Week Dividend:
$0.00
etc.

BeautifulSoup 可以在具有可预测架构的网站上轻松使用。

于 2013-07-22T20:00:34.777 回答
0
def scrubdividata(ticker):
try:
    end = '</td>'
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass
于 2013-07-22T19:10:52.643 回答
0

请注意,lines它将包含您需要的行,因此无需f.readlines()再次调用。简单地重复使用lines

小笔记:您可以使用for line in lines

def scrubdividata(ticker):
  try:
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for line in lines:
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)

    for line in lines:
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)

    for line in lines:
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            end = '</td>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
  except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass
于 2013-07-22T19:15:03.140 回答