import requests
from bs4 import BeautifulSoup
import re
r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
soup = BeautifulSoup(r.text, 'html.parser')
fees_div = soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent')
m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(2)')[0].get_text())
fee1 = m[0]
m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(3)')[0].get_text())
fee2 = m[0]
print(fee1, fee2)
印刷:
£9,250 £17,320
更新
您也可以使用 Selenium 抓取页面,尽管在这种情况下它没有任何优势。例如(使用 Chrome):
from selenium import webdriver
from bs4 import BeautifulSoup
import re
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
driver.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
soup = BeautifulSoup(driver.page_source, 'html.parser')
fees_div = soup.find('div', class_='Fees hiddenContent pad-around-large tabcontent')
m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(2)')[0].get_text())
fee1 = m[0]
m = re.search(r'£[\d,]+', fees_div.select('p:nth-of-type(3)')[0].get_text())
fee2 = m[0]
print(fee1, fee2)
driver.quit()
更新
考虑只使用以下内容:只需扫描整个 HTML 源代码而不使用 BeautifulSoup 使用简单的正则表达式查找费用findall
:
import requests
import re
r = requests.get('http://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba-spanish-and-history.aspx')
print(re.findall(r'£[\d,]+', r.text))
印刷:
['£9,250', '£17,320']