Given a typical keyword search in Google Scholar (see screenshot), I want to get a dictionary containing the title and url of each publication appearing on the page (eg. results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'
}.
To retrieve the results page from Google Scholar, I am using the following code:
from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup
class AppURLOpener(FancyURLopener):
version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page
This code correctly returns the results page, in (very ugly) HTML format. However, I have not been been able to progress beyond this point, as I could not figure out how to use BeautifulSoup (to which I am not too much familiarized) to parse the results page and retrieve the data.
Notice that the issue is with the parsing of and extracting of data from the results page, not with Google Scholar itself, since the results page is correctly retrieved by the above code.
Could anyone please give a few hints? Thanks in advance!