It is possible to search for text inside a tag. See https://stackoverflow.com/a/19816840/128421 for an example.
But, I'd do it something similar to this way:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/Richard_Dreyfuss'))
table = doc.at('#Filmography').parent.next_element
films = table.search('tr')[1..-1].map{ |tr|
tds = tr.search('td')
year = tds.shift.text
movie = tds.shift
movie_url = movie.at('a')['href']
movie_title = movie.at('a').text
role = tds.shift.text
{
year: year,
movie_url: movie_url,
movie_title: movie_title,
role: role
}
}
films
# => [{:year=>"1966",
# :movie_url=>"/wiki/Bewitched",
# :movie_title=>"Bewitched",
# :role=>"Rodney"},
# {:year=>"1966",
# :movie_url=>"/wiki/Gidget_(TV_series)",
# :movie_title=>"Gidget",
# :role=>"Durf the Drag"},
# {:year=>"1967",
# :movie_url=>"/wiki/Valley_of_the_Dolls_(film)",
# :movie_title=>"Valley of the Dolls",
# :role=>"Assistant stage manager"},
# {:year=>"1967",
# :movie_url=>"/wiki/The_Graduate",
# :movie_title=>"The Graduate",
# :role=>"Boarding House Resident"},
# {:year=>"1967",
# :movie_url=>"/wiki/The_Big_Valley",
# :movie_title=>"The Big Valley",
# :role=>"Lud Akley"},
# {:year=>"1968",
# :movie_url=>"/wiki/The_Young_Runaways",
# :movie_title=>"The Young Runaways",
# :role=>"Terry"},
# {:year=>"1969",
# :movie_url=>"/wiki/Hello_Down_There",
# :movie_title=>"Hello Down There",
# :role=>"Harold Webster"},
# {:year=>"1970",
# :movie_url=>"/wiki/The_Mod_Squad",
# :movie_title=>"The Mod Squad",
# :role=>"Curtis Bell"},
# {:year=>"1973",
# :movie_url=>"/wiki/American_Graffiti",
# :movie_title=>"American Graffiti",
# :role=>"Curt Henderson"},
# {:year=>"1973",
# :movie_url=>"/wiki/Dillinger_(1973_film)",
# :movie_title=>"Dillinger",
# :role=>"Baby Face Nelson"},
# {:year=>"1974",
# :movie_url=>"/wiki/The_Apprenticeship_of_Duddy_Kravitz_(film)",
# :movie_title=>"The Apprenticeship of Duddy Kravitz",
# :role=>"Duddy"},
# {:year=>"1974",
# :movie_url=>"/wiki/The_Second_Coming_of_Suzanne",
# :movie_title=>"The Second Coming of Suzanne",
# :role=>"Clavius"},
# {:year=>"1975",
# :movie_url=>"/wiki/Inserts_(film)",
# :movie_title=>"Inserts",
# :role=>"The Boy Wonder"},
# {:year=>"1975",
# :movie_url=>"/wiki/Jaws_(film)",
# :movie_title=>"Jaws",
# :role=>"Matt Hooper"},
# {:year=>"1976",
# :movie_url=>"/wiki/Victory_at_Entebbe",
# :movie_title=>"Victory at Entebbe",
# :role=>"Colonel Yonatan 'Yonni' Netanyahu"},
# {:year=>"1977",
# :movie_url=>"/wiki/Close_Encounters_of_the_Third_Kind",
# :movie_title=>"Close Encounters of the Third Kind",
# :role=>"Roy Neary"},
# {:year=>"1977",
# :movie_url=>"/wiki/The_Goodbye_Girl",
# :movie_title=>"The Goodbye Girl",
# :role=>"Elliott Garfield"},
# {:year=>"1978",
# :movie_url=>"/wiki/The_Big_Fix",
# :movie_title=>"The Big Fix",
# :role=>"Moses Wine"},
# {:year=>"1980",
# :movie_url=>"/wiki/The_Competition_(film)",
# :movie_title=>"The Competition",
# :role=>"Paul Dietrich"},
# {:year=>"1981",
# :movie_url=>"/wiki/Whose_Life_Is_It_Anyway%3F_(1981_film)",
# :movie_title=>"Whose Life Is It Anyway?",
# :role=>"Ken Harrison"},
# {:year=>"1984",
# :movie_url=>"/wiki/The_Buddy_System_(film)",
# :movie_title=>"The Buddy System",
# :role=>"Joe"},
# {:year=>"1986",
# :movie_url=>"/wiki/Down_and_Out_in_Beverly_Hills",
# :movie_title=>"Down and Out in Beverly Hills",
# :role=>"David 'Dave' Whiteman"},
# {:year=>"1986",
# :movie_url=>"/wiki/Stand_by_Me_(film)",
# :movie_title=>"Stand by Me",
# :role=>"Narrator/Gordie LaChance (adult)"},
# {:year=>"1987",
# :movie_url=>"/wiki/Tin_Men",
# :movie_title=>"Tin Men",
# :role=>"Bill 'BB' Babowsky"},
# {:year=>"1987",
# :movie_url=>"/wiki/Stakeout_(1987_film)",
# :movie_title=>"Stakeout",
# :role=>"Det. Chris Lecce"},
# {:year=>"1987",
# :movie_url=>"/wiki/Nuts_(film)",
# :movie_title=>"Nuts",
# :role=>"Aaron Levinsky"},
# {:year=>"1988",
# :movie_url=>"/wiki/Moon_Over_Parador",
# :movie_title=>"Moon Over Parador",
# :role=>"Jack Noah/President Alphonse Simms"},
# {:year=>"1989",
# :movie_url=>"/wiki/Let_It_Ride_(film)",
# :movie_title=>"Let It Ride",
# :role=>"Jay Trotter"},
# {:year=>"1989",
# :movie_url=>"/wiki/Always_(1989_film)",
# :movie_title=>"Always",
# :role=>"Pete Sandich"},
# {:year=>"1990",
# :movie_url=>"/wiki/Rosencrantz_%26_Guildenstern_Are_Dead_(film)",
# :movie_title=>"Rosencrantz & Guildenstern Are Dead",
# :role=>"The Player"},
# {:year=>"1990",
# :movie_url=>"/wiki/Postcards_from_the_Edge_(film)",
# :movie_title=>"Postcards from the Edge",
# :role=>"Doctor Frankenthal"},
# {:year=>"1991",
# :movie_url=>"/wiki/Once_Around",
# :movie_title=>"Once Around",
# :role=>"Sam Sharpe"},
# {:year=>"1991",
# :movie_url=>"/wiki/Prisoner_of_Honor",
# :movie_title=>"Prisoner of Honor",
# :role=>"Col. Picquart"},
# {:year=>"1991",
# :movie_url=>"/wiki/What_About_Bob%3F",
# :movie_title=>"What About Bob?",
# :role=>"Dr. Leo Marvin"},
# {:year=>"1993",
# :movie_url=>"/wiki/Lost_in_Yonkers_(film)",
# :movie_title=>"Lost in Yonkers",
# :role=>"Louie Kurnitz"},
# {:year=>"1993",
# :movie_url=>"/wiki/Another_Stakeout",
# :movie_title=>"Another Stakeout",
# :role=>"Detective Chris Lecce"},
# {:year=>"1994",
# :movie_url=>"/wiki/Silent_Fall",
# :movie_title=>"Silent Fall",
# :role=>"Dr. Jake Rainer"},
# {:year=>"1995",
# :movie_url=>
# "/w/index.php?title=The_Last_Word_(1995_film)&action=edit&redlink=1",
# :movie_title=>"The Last Word",
# :role=>"Larry"},
# {:year=>"1995",
# :movie_url=>"/wiki/The_American_President_(film)",
# :movie_title=>"The American President",
# :role=>"Senator Bob Rumson"},
# {:year=>"1995",
# :movie_url=>"/wiki/Mr._Holland%27s_Opus",
# :movie_title=>"Mr. Holland's Opus",
# :role=>"Glenn Holland"},
# {:year=>"1996",
# :movie_url=>"/wiki/James_and_the_Giant_Peach_(film)",
# :movie_title=>"James and the Giant Peach",
# :role=>"Centipede (voice)"},
# {:year=>"1996",
# :movie_url=>"/wiki/Mad_Dog_Time",
# :movie_title=>"Mad Dog Time",
# :role=>"Vic"},
# {:year=>"1997",
# :movie_url=>"/wiki/Night_Falls_on_Manhattan",
# :movie_title=>"Night Falls on Manhattan",
# :role=>"Sam Vigoda"},
# {:year=>"1997",
# :movie_url=>"/wiki/Oliver_Twist_(1997_film)",
# :movie_title=>"Oliver Twist",
# :role=>"Fagin"},
# {:year=>"1998",
# :movie_url=>"/wiki/Krippendorf%27s_Tribe",
# :movie_title=>"Krippendorf's Tribe",
# :role=>"Prof. James Krippendorf"},
# {:year=>"1999",
# :movie_url=>"/wiki/Lansky_(film)",
# :movie_title=>"Lansky",
# :role=>"Meyer Lansky"},
# {:year=>"2000",
# :movie_url=>"/wiki/The_Crew_(2000_film)",
# :movie_title=>"The Crew",
# :role=>"Bobby Bartellemeo/Narrator"},
# {:year=>"2000",
# :movie_url=>"/wiki/Fail_Safe_(2000_TV)",
# :movie_title=>"Fail Safe",
# :role=>"President of the United States"},
# {:year=>"2001",
# :movie_url=>"/wiki/The_Old_Man_Who_Read_Love_Stories",
# :movie_title=>"The Old Man Who Read Love Stories",
# :role=>"Antonio Bolivar"},
# {:year=>"2001",
# :movie_url=>"/wiki/Who_Is_Cletis_Tout%3F",
# :movie_title=>"Who Is Cletis Tout?",
# :role=>"Micah Donnelly"},
# {:year=>"2001",
# :movie_url=>"/wiki/The_Education_of_Max_Bickford",
# :movie_title=>"The Education of Max Bickford",
# :role=>"Max Bickford"},
# {:year=>"2001",
# :movie_url=>"/wiki/The_Day_Reagan_Was_Shot",
# :movie_title=>"The Day Reagan Was Shot",
# :role=>"Alexander Haig"},
# {:year=>"2003",
# :movie_url=>"/wiki/Coast_to_Coast_(TV_film)",
# :movie_title=>"Coast to Coast",
# :role=>"Barnaby Pierce"},
# {:year=>"2004",
# :movie_url=>"/wiki/Silver_City_(2004_film)",
# :movie_title=>"Silver City",
# :role=>"Chuck Raven"},
# {:year=>"2006",
# :movie_url=>"/wiki/Poseidon_(film)",
# :movie_title=>"Poseidon",
# :role=>"Richard Nelson"},
# {:year=>"2007",
# :movie_url=>"/wiki/Tin_Man_(TV_miniseries)",
# :movie_title=>"Tin Man",
# :role=>"Mystic Man"},
# {:year=>"2007",
# :movie_url=>"/wiki/Ocean_of_Fear",
# :movie_title=>"Ocean of Fear",
# :role=>"Narrator"},
# {:year=>"2008",
# :movie_url=>"/wiki/Signs_of_the_Time_(film)",
# :movie_title=>"Signs of the Time",
# :role=>"Narrator"},
# {:year=>"2008",
# :movie_url=>"/wiki/W._(film)",
# :movie_title=>"W.",
# :role=>"Dick Cheney"},
# {:year=>"2008",
# :movie_url=>"/w/index.php?title=America_Betrayed&action=edit&redlink=1",
# :movie_title=>"America Betrayed",
# :role=>"Narrator"},
# {:year=>"2009",
# :movie_url=>"/wiki/My_Life_in_Ruins",
# :movie_title=>"My Life in Ruins",
# :role=>"Irv"},
# {:year=>"2009",
# :movie_url=>"/wiki/Leaves_of_Grass_(film)",
# :movie_title=>"Leaves of Grass",
# :role=>"Pug Rothbaum"},
# {:year=>"2009",
# :movie_url=>"/wiki/The_Lightkeepers",
# :movie_title=>"The Lightkeepers",
# :role=>"Seth"},
# {:year=>"2010",
# :movie_url=>"/wiki/Piranha_3D",
# :movie_title=>"Piranha 3D",
# :role=>"Matthew Boyd"},
# {:year=>"2010",
# :movie_url=>"/wiki/Weeds_(TV_series)",
# :movie_title=>"Weeds",
# :role=>"Warren Schiff"},
# {:year=>"2010",
# :movie_url=>"/wiki/RED_(film)",
# :movie_title=>"RED",
# :role=>"Alexander Dunning"},
# {:year=>"2012",
# :movie_url=>"/wiki/Coma_(U.S._miniseries)",
# :movie_title=>"Coma",
# :role=>"Professor Hillside"},
# {:year=>"2013",
# :movie_url=>"/wiki/Very_Good_Girls",
# :movie_title=>"Very Good Girls",
# :role=>"Danny, Gerry's father"},
# {:year=>"2013",
# :movie_url=>"/wiki/Paranoia_(2013_film)",
# :movie_title=>"Paranoia",
# :role=>"Francis Cassidy"}]
To explain what it's doing:
The "Filmology" table is a good source for the information; It's organized logically, so writing code to walk through it is easy.
doc.at('#Filmography').parent.next_element
finds that table using the <h2>
heading just above it, then backs up and looks in the next tag, which is the table itself.
table.search('tr')[1..-1]
finds the <tr>
rows inside the table, skips the first, then iterates (using map
) over the remaining ones.
tds = tr.search('td')
finds the cells for the table. From that point on it's a matter of peeling that NodeSet apart like an array, by looking at the elements I want. The rest of the code should be pretty obvious. Once the individual parts are retrieved that are of interest they're bundled into a hash, which is returned as part of an array of hashes by map
.