php - PHP Regex - 如何在爬取的内容之间获取多个单词

Question

我正在尝试从 Alexa 获取最新的“热门话题”主题，用于我儿子的研究项目。我基本上只是想抓取单词并将其插入到 mysql 数据库中。

我目前拥有的是：

<?php
function getAlexa($url)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0");
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}
$grab = getAlexa("http://www.alexa.com/whatshot");

// missing part to get everything between title=''

// mysql connection details are included
include "connectiondetails.php";

// mysql table setup is id (int 11) AI, word (varchar 100) UNIQUE
$insert = @mysql_query("INSERT IGNORE INTO words values('','$word')");

?>

我的问题是我还是 PHP 新手，我需要从 alexa.com/whatshot 中获取所有 a href 标题（不是缩短的锚文本），就像 title=' 和下一个 ' 之间的所有内容一样 - 例如标题='hello world' 表示单词（字符串）将是 hello world - 只是我需要它来获取所有 20 个单词。

结构是：

<a href='http://www.alexa.com/whatshot?q=loretta+swit+turns+75' title='Loretta Swit Turns 75'>Loretta Swit Turns 75</a></li><li>2. <a href='http://www.alexa.com/whatshot?q=where+do+i+vote' title='where do I vote'>where do I vote</a></li><li>3. <a href='http://www.alexa.com/whatshot?q=nj+earthquake' title='NJ earthquake'>NJ earthquake</a></li><li>4. <a href='http://www.alexa.com/whatshot?q=yvonne+strahovski' title='Yvonne Strahovski'>Yvonne Strahovski</a></li><li>5. <a href='http://www.alexa.com/whatshot?q=early+voting+results' title='early voting results'>early voting results</a></li></ul><ul class='hotsearches' start='6'><li>6. <a href='http://www.alexa.com/whatshot?q=milt+campbell+dies' title='Milt Campbell Dies'>Milt Campbell Dies</a></li><li>7. <a href='http://www.alexa.com/whatshot?q=bristol+palin+suit+tossed' title='Bristol Palin suit tossed'>Bristol Palin suit...</a></li><li>8. <a href='http://www.alexa.com/whatshot?q=a+gay+lesbian' title='a gay lesbian'>a gay lesbian</a></li><li>9. <a href='http://www.alexa.com/whatshot?q=navy+skipper+fired' title='Navy skipper fired'>Navy skipper fired</a></li><li>10. <a href='http://www.alexa.com/whatshot?q=single+mom+no+tip' title='single mom no tip'>single mom no tip</a></li></ul><ul class='hotsearches' start='11'><li>11. <a href='http://www.alexa.com/whatshot?q=craigslist' title='craigslist'>craigslist</a></li><li>12. <a href='http://www.alexa.com/whatshot?q=nate+silver' title='Nate Silver'>Nate Silver</a></li><li>13. <a href='http://www.alexa.com/whatshot?q=real+clear+politics' title='real Clear Politics'>real Clear Politics</a></li><li>14. <a href='http://www.alexa.com/whatshot?q=93-year-old+bodybuilder' title='93-year-old bodybuilder'>93-year-old bodybu...</a></li><li>15. <a href='http://www.alexa.com/whatshot?q=wreck+it+ralph' title='Wreck It Ralph'>Wreck It Ralph</a></li></ul><ul class='hotsearches' start='16'><li>16. <a href='http://www.alexa.com/whatshot?q=kickstarter' title='Kickstarter'>Kickstarter</a></li><li>17. <a href='http://www.alexa.com/whatshot?q=african+painted+dogs' title='African painted dogs'>African painted dogs</a></li><li>18. <a href='http://www.alexa.com/whatshot?q=red+dawn' title='Red Dawn'>Red Dawn</a></li><li>19. <a href='http://www.alexa.com/whatshot?q=instagram' title='Instagram'>Instagram</a></li><li>20. <a href='http://www.alexa.com/whatshot?q=iphone+5' title='iPhone 5'>iPhone 5</a>

所以如果一切顺利，一旦抓取完成，我的数据库中就会有 20 个单词。

感谢您花时间阅读本文。

非常感谢您的帮助。

score 2 · Accepted Answer

您可以使用以下正则表达式来解析给定输入中的所有标题属性/值：

title=(?:(?:"([^"]+)")|(?:'([^']+)'))

要使用它，您可以使用 PHP preg_match_all()（假设您的 HTML 在您的$grab变量中）：

$titles = array();
preg_match_all('/title=(?:(?:"([^"]+)")|(?:\'([^\']+)\'))/i', $grab, $titles);

此正则表达式将尝试匹配双引号或单引号内的任何值。从您的示例输出中，它看起来好像都是单引号。数组中匹配的标题$titles将分为两组。第一组在$titles[1]；这些都是双引号中的所有标题。$titles[2]包含单引号中的标题列表。

您可以将它们合并在一起，array_merge()然后array_filter()删除空值。然后，您可以像往常一样遍历它们：

$titles = array_filter(array_merge($titles[1], $titles[2]));
foreach ($titles as $title) {
    .. do whatever you need/want
}

更新（标题中的转义引号）
根据评论，我意识到我的原始正则表达式与其中包含转义引号的标题不匹配（例如title="foo \"bar\""）。以下正则表达式（从此答案移植）应处理此问题：

title=(?:(?:"([^"\\]*(?:\\.[^"\\]*)*)")|(?:\'([^'\\]*(?:\\.[^'\\]*)*)\'))

要在中使用它preg_match_all()，您将使用：

preg_match_all('/title=(?:(?:"([^"\\\\]*(?:\\\\.[^"\\\\]*)*)")|(?:\'([^\'\\\\]*(?:\\\\.[^\'\\\\]*)*)\'))/i', $grab, $titles);

虽然它更长，并且可能无法在没有解释的情况下阅读（如果你需要一个让我知道，我会发布），但如果你开始注意到由于转义引号而丢失的数据，那绝对是值得的！

score 0 · Accepted Answer

你想要这个：http ://www.php.net/manual/en/function.preg-match-all.php

还有这个：http ://www.regular-expressions.info/reference.html

假设您处理了正则表达式部分，您将执行以下操作：

$titles;
preg_match_all('@YOUR_REGEX_HERE@/i', $grab, $titles);// The "@" is just a regex delimiter

php - PHP Regex - 如何在爬取的内容之间获取多个单词

2 回答 2

Related

Reference