pandas - 为搜索列表中的标题拉取 imdbID

Question

是否可以获得符合搜索条件（例如投票数、语言、发行年份等）的标题的所有 IMDb ID？

我的首要任务是编制一份所有 IMDb ID 的列表，这些 ID 被归类为故事片，并有超过 25,000 票（也就是那些符合条件的人出现在前 250 名名单上），就像它出现在这里一样。在此发布时，有 4,296 部电影符合该标准。

（如果您不熟悉 IMDb ID：它是与数据库中的每部电影/人物/角色/等相关联的唯一 7 位代码。例如，对于电影“Drive”（2011），IMDb ID 是“ 0780504” .)

但是，在将来，设置我认为合适的搜索条件会很有帮助，就像我在输入 url 地址时一样（使用 &num_votes=##, &year=##, &title_type=##, ...）

我一直在使用 IMDBpy 来获取有关单个电影标题的信息并取得了巨大成功，如果我描述的这个搜索功能可以通过该库访问，我会很高兴。

到现在为止，我一直在生成随机的 7 位字符串并测试它们是否符合我的标准，但这将是低效的，因为我将处理时间浪费在多余的 ID 上。

from imdb import IMDb, IMDbError
import random
i =  IMDb(accessSystem='http')
movies = []
for _ in range(11000):
    randID = str(random.randint(0, 7221897)).zfill(7)
    movies.append(randID)

for m in movies:
    try:
        movie = i.get_movie(m)
    except IMDbError as err:
      print(err)`

    if str(movie)=='':
        continue

    kind = movie.get('kind')
    if kind != 'movie':
        continue

    votes=movie.get('votes')
    if votes == None:
        continue

    if votes>=25000:

score 2 · Accepted Answer

看看http://www.omdbapi.com/ 可以直接使用API，按标题或ID进行搜索。

在python3中

import urllib.request
urllib.request.urlopen("http://www.omdbapi.com/?apikey=27939b55&s=moana").read()

score 0 · Accepted Answer

根据Alexandru Olteanu编写的教程找到了使用 Beautiful Soup 的解决方案

这是我的代码：

from requests import get
from bs4 import BeautifulSoup
import re
import math
from time import time, sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn

url = "http://www.imdb.com/search/title?num_votes=25000,&title_type=feature&view=simple&sort=num_votes,desc&page=1&ref_=adv_nxt"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

num_films_text = html_soup.find_all('div', class_ = 'desc')
num_films=re.search('of (\d.+) titles',str(num_films_text[0])).group(1)
num_films=int(num_films.replace(',', ''))
print(num_films)

num_pages = math.ceil(num_films/50)
print(num_pages)

ids = []
start_time = time()
requests = 0

# For every page in the interval`
for page in range(1,num_pages+1):    
    # Make a get request    
    url = "http://www.imdb.com/search/title?num_votes=25000,&title_type=feature&view=simple&sort=num_votes,desc&page="+str(page)+"&ref_=adv_nxt"
    response = get(url)

    # Pause the loop
    sleep(randint(8,15))  

    # Monitor the requests
    requests += 1
    sleep(randint(1,3))
    elapsed_time = time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True) 

    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))   

    # Break the loop if the number of requests is greater than expected
    if requests > num_pages:
        warn('Number of requests was greater than expected.')  
        break

    # Parse the content of the request with BeautifulSoup
    page_html = BeautifulSoup(response.text, 'html.parser')

    # Select all the 50 movie containers from a single page
    movie_containers = page_html.find_all('div', class_ = 'lister-item mode-simple')

    # Scrape the ID 
    for i in range(len(movie_containers)):
        id = re.search('tt(\d+)/',str(movie_containers[i].a)).group(1)
        ids.append(id)
print(ids)

pandas - 为搜索列表中的标题拉取 imdbID

2 回答 2

Related

Reference