2

I am parsing the links on wikipedia pages of actors, and trying to find links to films they appeared in.

I have a basic method that searchs the links and checks for the word film in the link. However many of the links to films do not actually contain this word.

However, within the paragraphs that the links are contained in, the word film appears , for example:

    <p>Dreyfuss's first film part was a small, uncredited role in 
<i><a href="/wiki/The_Graduate" title="The Graduate">The Graduate 

    // Paragraph goes on for a long time. 

Here is the block from the method that checks all the links:

all_links = doca.search('//a[@href]')
    all_links.each do |link|
        link_info = link['href']
        if link_info.include?("(film)") && !(link_info.include?("Category:") || link_info.include?("php"))
            then out << link_info end
      end
    out.uniq.collect {|link| strip_out_name(link)}

Would there be a way of checking the previous text before the link but after the <p> tag for the word film, but being careful not to check other links (and also perhaps limited the search to 50 characters before the link)?

Thanks for any help or suggestions.

Click here, this is the main page that I am testing on

4

3 回答 3

2

It is possible to search for text inside a tag. See https://stackoverflow.com/a/19816840/128421 for an example.

But, I'd do it something similar to this way:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/Richard_Dreyfuss'))

table = doc.at('#Filmography').parent.next_element
films = table.search('tr')[1..-1].map{ |tr|
  tds = tr.search('td')
  year = tds.shift.text

  movie = tds.shift
  movie_url = movie.at('a')['href']
  movie_title = movie.at('a').text

  role = tds.shift.text

  {
    year: year,
    movie_url: movie_url,
    movie_title: movie_title,
    role: role
  }
}

films 
# => [{:year=>"1966",
#      :movie_url=>"/wiki/Bewitched",
#      :movie_title=>"Bewitched",
#      :role=>"Rodney"},
#     {:year=>"1966",
#      :movie_url=>"/wiki/Gidget_(TV_series)",
#      :movie_title=>"Gidget",
#      :role=>"Durf the Drag"},
#     {:year=>"1967",
#      :movie_url=>"/wiki/Valley_of_the_Dolls_(film)",
#      :movie_title=>"Valley of the Dolls",
#      :role=>"Assistant stage manager"},
#     {:year=>"1967",
#      :movie_url=>"/wiki/The_Graduate",
#      :movie_title=>"The Graduate",
#      :role=>"Boarding House Resident"},
#     {:year=>"1967",
#      :movie_url=>"/wiki/The_Big_Valley",
#      :movie_title=>"The Big Valley",
#      :role=>"Lud Akley"},
#     {:year=>"1968",
#      :movie_url=>"/wiki/The_Young_Runaways",
#      :movie_title=>"The Young Runaways",
#      :role=>"Terry"},
#     {:year=>"1969",
#      :movie_url=>"/wiki/Hello_Down_There",
#      :movie_title=>"Hello Down There",
#      :role=>"Harold Webster"},
#     {:year=>"1970",
#      :movie_url=>"/wiki/The_Mod_Squad",
#      :movie_title=>"The Mod Squad",
#      :role=>"Curtis Bell"},
#     {:year=>"1973",
#      :movie_url=>"/wiki/American_Graffiti",
#      :movie_title=>"American Graffiti",
#      :role=>"Curt Henderson"},
#     {:year=>"1973",
#      :movie_url=>"/wiki/Dillinger_(1973_film)",
#      :movie_title=>"Dillinger",
#      :role=>"Baby Face Nelson"},
#     {:year=>"1974",
#      :movie_url=>"/wiki/The_Apprenticeship_of_Duddy_Kravitz_(film)",
#      :movie_title=>"The Apprenticeship of Duddy Kravitz",
#      :role=>"Duddy"},
#     {:year=>"1974",
#      :movie_url=>"/wiki/The_Second_Coming_of_Suzanne",
#      :movie_title=>"The Second Coming of Suzanne",
#      :role=>"Clavius"},
#     {:year=>"1975",
#      :movie_url=>"/wiki/Inserts_(film)",
#      :movie_title=>"Inserts",
#      :role=>"The Boy Wonder"},
#     {:year=>"1975",
#      :movie_url=>"/wiki/Jaws_(film)",
#      :movie_title=>"Jaws",
#      :role=>"Matt Hooper"},
#     {:year=>"1976",
#      :movie_url=>"/wiki/Victory_at_Entebbe",
#      :movie_title=>"Victory at Entebbe",
#      :role=>"Colonel Yonatan 'Yonni' Netanyahu"},
#     {:year=>"1977",
#      :movie_url=>"/wiki/Close_Encounters_of_the_Third_Kind",
#      :movie_title=>"Close Encounters of the Third Kind",
#      :role=>"Roy Neary"},
#     {:year=>"1977",
#      :movie_url=>"/wiki/The_Goodbye_Girl",
#      :movie_title=>"The Goodbye Girl",
#      :role=>"Elliott Garfield"},
#     {:year=>"1978",
#      :movie_url=>"/wiki/The_Big_Fix",
#      :movie_title=>"The Big Fix",
#      :role=>"Moses Wine"},
#     {:year=>"1980",
#      :movie_url=>"/wiki/The_Competition_(film)",
#      :movie_title=>"The Competition",
#      :role=>"Paul Dietrich"},
#     {:year=>"1981",
#      :movie_url=>"/wiki/Whose_Life_Is_It_Anyway%3F_(1981_film)",
#      :movie_title=>"Whose Life Is It Anyway?",
#      :role=>"Ken Harrison"},
#     {:year=>"1984",
#      :movie_url=>"/wiki/The_Buddy_System_(film)",
#      :movie_title=>"The Buddy System",
#      :role=>"Joe"},
#     {:year=>"1986",
#      :movie_url=>"/wiki/Down_and_Out_in_Beverly_Hills",
#      :movie_title=>"Down and Out in Beverly Hills",
#      :role=>"David 'Dave' Whiteman"},
#     {:year=>"1986",
#      :movie_url=>"/wiki/Stand_by_Me_(film)",
#      :movie_title=>"Stand by Me",
#      :role=>"Narrator/Gordie LaChance (adult)"},
#     {:year=>"1987",
#      :movie_url=>"/wiki/Tin_Men",
#      :movie_title=>"Tin Men",
#      :role=>"Bill 'BB' Babowsky"},
#     {:year=>"1987",
#      :movie_url=>"/wiki/Stakeout_(1987_film)",
#      :movie_title=>"Stakeout",
#      :role=>"Det. Chris Lecce"},
#     {:year=>"1987",
#      :movie_url=>"/wiki/Nuts_(film)",
#      :movie_title=>"Nuts",
#      :role=>"Aaron Levinsky"},
#     {:year=>"1988",
#      :movie_url=>"/wiki/Moon_Over_Parador",
#      :movie_title=>"Moon Over Parador",
#      :role=>"Jack Noah/President Alphonse Simms"},
#     {:year=>"1989",
#      :movie_url=>"/wiki/Let_It_Ride_(film)",
#      :movie_title=>"Let It Ride",
#      :role=>"Jay Trotter"},
#     {:year=>"1989",
#      :movie_url=>"/wiki/Always_(1989_film)",
#      :movie_title=>"Always",
#      :role=>"Pete Sandich"},
#     {:year=>"1990",
#      :movie_url=>"/wiki/Rosencrantz_%26_Guildenstern_Are_Dead_(film)",
#      :movie_title=>"Rosencrantz & Guildenstern Are Dead",
#      :role=>"The Player"},
#     {:year=>"1990",
#      :movie_url=>"/wiki/Postcards_from_the_Edge_(film)",
#      :movie_title=>"Postcards from the Edge",
#      :role=>"Doctor Frankenthal"},
#     {:year=>"1991",
#      :movie_url=>"/wiki/Once_Around",
#      :movie_title=>"Once Around",
#      :role=>"Sam Sharpe"},
#     {:year=>"1991",
#      :movie_url=>"/wiki/Prisoner_of_Honor",
#      :movie_title=>"Prisoner of Honor",
#      :role=>"Col. Picquart"},
#     {:year=>"1991",
#      :movie_url=>"/wiki/What_About_Bob%3F",
#      :movie_title=>"What About Bob?",
#      :role=>"Dr. Leo Marvin"},
#     {:year=>"1993",
#      :movie_url=>"/wiki/Lost_in_Yonkers_(film)",
#      :movie_title=>"Lost in Yonkers",
#      :role=>"Louie Kurnitz"},
#     {:year=>"1993",
#      :movie_url=>"/wiki/Another_Stakeout",
#      :movie_title=>"Another Stakeout",
#      :role=>"Detective Chris Lecce"},
#     {:year=>"1994",
#      :movie_url=>"/wiki/Silent_Fall",
#      :movie_title=>"Silent Fall",
#      :role=>"Dr. Jake Rainer"},
#     {:year=>"1995",
#      :movie_url=>
#       "/w/index.php?title=The_Last_Word_(1995_film)&action=edit&redlink=1",
#      :movie_title=>"The Last Word",
#      :role=>"Larry"},
#     {:year=>"1995",
#      :movie_url=>"/wiki/The_American_President_(film)",
#      :movie_title=>"The American President",
#      :role=>"Senator Bob Rumson"},
#     {:year=>"1995",
#      :movie_url=>"/wiki/Mr._Holland%27s_Opus",
#      :movie_title=>"Mr. Holland's Opus",
#      :role=>"Glenn Holland"},
#     {:year=>"1996",
#      :movie_url=>"/wiki/James_and_the_Giant_Peach_(film)",
#      :movie_title=>"James and the Giant Peach",
#      :role=>"Centipede (voice)"},
#     {:year=>"1996",
#      :movie_url=>"/wiki/Mad_Dog_Time",
#      :movie_title=>"Mad Dog Time",
#      :role=>"Vic"},
#     {:year=>"1997",
#      :movie_url=>"/wiki/Night_Falls_on_Manhattan",
#      :movie_title=>"Night Falls on Manhattan",
#      :role=>"Sam Vigoda"},
#     {:year=>"1997",
#      :movie_url=>"/wiki/Oliver_Twist_(1997_film)",
#      :movie_title=>"Oliver Twist",
#      :role=>"Fagin"},
#     {:year=>"1998",
#      :movie_url=>"/wiki/Krippendorf%27s_Tribe",
#      :movie_title=>"Krippendorf's Tribe",
#      :role=>"Prof. James Krippendorf"},
#     {:year=>"1999",
#      :movie_url=>"/wiki/Lansky_(film)",
#      :movie_title=>"Lansky",
#      :role=>"Meyer Lansky"},
#     {:year=>"2000",
#      :movie_url=>"/wiki/The_Crew_(2000_film)",
#      :movie_title=>"The Crew",
#      :role=>"Bobby Bartellemeo/Narrator"},
#     {:year=>"2000",
#      :movie_url=>"/wiki/Fail_Safe_(2000_TV)",
#      :movie_title=>"Fail Safe",
#      :role=>"President of the United States"},
#     {:year=>"2001",
#      :movie_url=>"/wiki/The_Old_Man_Who_Read_Love_Stories",
#      :movie_title=>"The Old Man Who Read Love Stories",
#      :role=>"Antonio Bolivar"},
#     {:year=>"2001",
#      :movie_url=>"/wiki/Who_Is_Cletis_Tout%3F",
#      :movie_title=>"Who Is Cletis Tout?",
#      :role=>"Micah Donnelly"},
#     {:year=>"2001",
#      :movie_url=>"/wiki/The_Education_of_Max_Bickford",
#      :movie_title=>"The Education of Max Bickford",
#      :role=>"Max Bickford"},
#     {:year=>"2001",
#      :movie_url=>"/wiki/The_Day_Reagan_Was_Shot",
#      :movie_title=>"The Day Reagan Was Shot",
#      :role=>"Alexander Haig"},
#     {:year=>"2003",
#      :movie_url=>"/wiki/Coast_to_Coast_(TV_film)",
#      :movie_title=>"Coast to Coast",
#      :role=>"Barnaby Pierce"},
#     {:year=>"2004",
#      :movie_url=>"/wiki/Silver_City_(2004_film)",
#      :movie_title=>"Silver City",
#      :role=>"Chuck Raven"},
#     {:year=>"2006",
#      :movie_url=>"/wiki/Poseidon_(film)",
#      :movie_title=>"Poseidon",
#      :role=>"Richard Nelson"},
#     {:year=>"2007",
#      :movie_url=>"/wiki/Tin_Man_(TV_miniseries)",
#      :movie_title=>"Tin Man",
#      :role=>"Mystic Man"},
#     {:year=>"2007",
#      :movie_url=>"/wiki/Ocean_of_Fear",
#      :movie_title=>"Ocean of Fear",
#      :role=>"Narrator"},
#     {:year=>"2008",
#      :movie_url=>"/wiki/Signs_of_the_Time_(film)",
#      :movie_title=>"Signs of the Time",
#      :role=>"Narrator"},
#     {:year=>"2008",
#      :movie_url=>"/wiki/W._(film)",
#      :movie_title=>"W.",
#      :role=>"Dick Cheney"},
#     {:year=>"2008",
#      :movie_url=>"/w/index.php?title=America_Betrayed&action=edit&redlink=1",
#      :movie_title=>"America Betrayed",
#      :role=>"Narrator"},
#     {:year=>"2009",
#      :movie_url=>"/wiki/My_Life_in_Ruins",
#      :movie_title=>"My Life in Ruins",
#      :role=>"Irv"},
#     {:year=>"2009",
#      :movie_url=>"/wiki/Leaves_of_Grass_(film)",
#      :movie_title=>"Leaves of Grass",
#      :role=>"Pug Rothbaum"},
#     {:year=>"2009",
#      :movie_url=>"/wiki/The_Lightkeepers",
#      :movie_title=>"The Lightkeepers",
#      :role=>"Seth"},
#     {:year=>"2010",
#      :movie_url=>"/wiki/Piranha_3D",
#      :movie_title=>"Piranha 3D",
#      :role=>"Matthew Boyd"},
#     {:year=>"2010",
#      :movie_url=>"/wiki/Weeds_(TV_series)",
#      :movie_title=>"Weeds",
#      :role=>"Warren Schiff"},
#     {:year=>"2010",
#      :movie_url=>"/wiki/RED_(film)",
#      :movie_title=>"RED",
#      :role=>"Alexander Dunning"},
#     {:year=>"2012",
#      :movie_url=>"/wiki/Coma_(U.S._miniseries)",
#      :movie_title=>"Coma",
#      :role=>"Professor Hillside"},
#     {:year=>"2013",
#      :movie_url=>"/wiki/Very_Good_Girls",
#      :movie_title=>"Very Good Girls",
#      :role=>"Danny, Gerry's father"},
#     {:year=>"2013",
#      :movie_url=>"/wiki/Paranoia_(2013_film)",
#      :movie_title=>"Paranoia",
#      :role=>"Francis Cassidy"}]

To explain what it's doing:

The "Filmology" table is a good source for the information; It's organized logically, so writing code to walk through it is easy.

doc.at('#Filmography').parent.next_element

finds that table using the <h2> heading just above it, then backs up and looks in the next tag, which is the table itself.

table.search('tr')[1..-1] finds the <tr> rows inside the table, skips the first, then iterates (using map) over the remaining ones.

tds = tr.search('td') finds the cells for the table. From that point on it's a matter of peeling that NodeSet apart like an array, by looking at the elements I want. The rest of the code should be pretty obvious. Once the individual parts are retrieved that are of interest they're bundled into a hash, which is returned as part of an array of hashes by map.

于 2013-11-06T17:32:27.113 回答
1

Why not try parsing out the filmography section of the wikipedia article? It seems pretty standard across the few actors that I looked at, and it mentions whether or not it was a TV series so you could filter those out easily.

<tr>
    <td>1966</td>
    <td><i><a href="/wiki/Gidget_(TV_series)" title="Gidget (TV series)">Gidget</a></i></td>
    <td>Durf the Drag</td>
    <td>TV series 1 episode</td>
</tr>
<tr>
    <td>1967</td>
    <td><i><a href="/wiki/Valley_of_the_Dolls_(film)" title="Valley of the Dolls (film)">Valley of the Dolls</a></i></td>
    <td>Assistant stage manager</td>
    <td>Uncredited</td>
</tr>

Looks like you could pull nodes similar to this from the code and save all the info to do what you want with it. The first node could be disregarded since "TV" appears multiple times in the different subnodes.

Hope this helps!

-Larry

于 2013-11-06T16:12:00.967 回答
1

Okay So I have tested the code based on your actual request and come up with the following

url = "http://en.wikipedia.org/wiki/Richard_Dreyfuss"
doc = Nokogiri::HTML(open(url))
all_links = doc.search("//a[@href]")
all_links.each do |link|
  p_text = link.ancestors("p").text
  link_index = p_text.index(link.text)
  unless link_index.nil?
     search_back = link_index > 50 ? link_index - 50 : 0
     p_text[search_back..link_index].downcase.include?("film") ? puts(link['href']) : nil
  end
end

Output

#=>/wiki/American_Graffiti
   /wiki/Jaws_(film)
   /wiki/Close_Encounters_of_the_Third_Kind
   /wiki/The_Graduate
   /wiki/The_Apprenticeship_of_Duddy_Kravitz_(film)
   /wiki/Down_And_Out_In_Beverly_Hills
   /wiki/Stakeout_(1987_film)
   /wiki/Stephen_King
   /wiki/The_Body_(novella)
   /wiki/Poseidon_(film)
   #cite_note-27
   /wiki/Jonathan_Tasini

This seems to satisfy the question you were asking but obviously needs to be modified to fit your needs.

Edit

Added your request for running back on 50 characters in the paragraph the response is much shorter now but I am not sure that the results will be as useful as you'd like. This answers the question but does not capture exactly what you are hoping for e.g. the last 2 links are not to films but they are within 50 characters of the world film.

于 2013-11-06T16:22:51.600 回答