0

I'm not very good at regex and looked everywhere i could. I could use some help to parse this page (http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3) to get the movies name . P.S: Could use a dummy regex too.

4

2 回答 2

3

简答

这与您之前的问题几乎相同,答案也相同......尽管使用了修改过的正则表达式。

#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s

https://stackoverflow.com/a/19600974/2573622


扩展答案

关于正则表达式

有关更多信息,您可能需要查看以下链接:

http://www.regular-expressions.info/

单击顶部菜单栏上的教程,几乎所有正则表达式都有解释。

制作正则表达式

首先,您必须从页面中获取相关的 html(对于一部电影)...

<td class="number">RANK.</td>
  <td class="image">
    <a href="/title/tt000000/" title="FILM TITLE (YEAR)"><img src="http://imdb.com/path-to-image.jpg" height="74" width="54" alt="FILM TITLE (YEAR)" title="FILM TITLE (YEAR)"></a>
  </td>
  <td class="title">
    

<span class="wlb_wrapper" data-tconst="tt000000" data-size="small" data-caller-name="search"></span>

    <a href="/title/tt000000/">FILM TITLE</a>

然后你去掉噪音/可变信息......

<td class="number">RANK.</td>.*?<a href="/title/tt\d+/">FILM TITLE</a>

然后添加您的捕获组...

<td class="number">(RANK).</td>.*?<a href="/title/tt\d+/">(FILM TITLE)</a>

就是这样:

 #<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s

结束模式分隔符后的s修饰符使正则表达式引擎也.匹配新行

带代码

与上一个答案相同(使用修改的正则表达式)

$page = file_get_contents('http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3');

preg_match_all('#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s', $page, $matches);


$filmList = array_combine($matches[1], $matches[2]);

然后你可以这样做:

echo $filmList[1];

/**
Output:

Argo

*/

echo array_search("The Artist", $filmList);

/**
Output:

2

*/

http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://php.net/file_get_contents
http://php.net/preg_match_all
http://php.net/array_combine
http: //php.net/array_search

于 2013-10-27T00:03:12.553 回答
0

不确定您需要/不需要哪些反斜杠:

href=\"\/title\/tt.*height=\"74\" width=\"54\" alt=\"([^"]*)\"

有用的结果是\1$1

于 2013-10-26T14:32:06.613 回答