我需要解析
如果您查看上述网址的来源,您会发现
预期输出:
fvRequests= css
fvRequests=7
我需要解析
如果您查看上述网址的来源,您会发现
预期输出:
fvRequests= css
fvRequests=7
这个想法是定位脚本BeautifulSoup
并使用正则表达式模式来查找fvRequests.setValue()
调用并提取第三个参数的值:
import re
from bs4 import BeautifulSoup
import requests
pattern = re.compile(r"fvRequests\.setValue\(\d+, \d+, '?(\w+)'?\);")
response = requests.get("http://www.webpagetest.org/breakdown.php?test=150325_34_0f581da87c16d5aac4ecb7cd07cda921&run=2&cached=0")
soup = BeautifulSoup(response.content)
script = soup.find("script", text=lambda x: x and "fvRequests.setValue" in x).text
print(re.findall(pattern, script))
印刷:
[u'css', u'7', u'flash', u'0', u'font', u'0', u'html', u'14', u'image', u'80', u'js', u'35', u'other', u'14']
您可以更进一步,将列表打包成一个字典(从这里获取的解决方案):
dict(zip(*([iter(data)] * 2)))
这将产生:
{
'image': '80',
'flash': '0',
'js': '35',
'html': '14',
'font': '0',
'other': '14',
'css': '7'
}
import re
import urllib2
if __name__ == "__main__":
url = 'http://www.webpagetest.org/breakdown.php?test=150325_34_0f581da87c16d5aac4ecb7cd07cda921&run=2&cached=0'
# http request
response = urllib2.urlopen(url)
html = response.read()
response.close()
# finding values in html
results = re.findall(r'fvRequests\.setValue\(\d+, \d+, \'?(.*?)\'?\);', html)
keys = results[::2]
values = results[1::2]
# creating a dictionary
output = dict(zip(keys, values))
print output