php - 从 HTML 中抓取唯一的图像 URL

Question

使用 PHP 卷曲网页（用户输入的某些 URL，假设它是有效的）。示例：http ://www.youtube.com/watch?v=Hovbx6rvBaA

我需要解析 HTML 并提取所有看起来像图像的去重 URL。不仅是中的那些，而且该页面上以等img src=""结尾的任何 URL 。jpe?g|bmp|gif|png（换句话说，我不想解析 DOM，但想使用 RegEx）。

然后我计划卷曲 URL 以获取它们的宽度和高度信息，并确保它们确实是图像，所以不要担心与安全相关的东西。

score 5 · Accepted Answer

使用 DOM 有什么问题？它使您可以更好地控制信息的上下文，并且您提取的内容实际上是 URL 的可能性更高。

<?php
$resultFromCurl = '
    <html>
    <body>
    <img src="hello.jpg" />
    <a href="yep.jpg">Yep</a>
    <table background="yep.jpg">
    </table>
    <p>
        Perhaps you should check out foo.jpg! I promise it 
        is safe for work.
    </p>
    </body>
    </html>
';

// these are all the attributes i could think of that
// can contain URLs.
$queries = array(
    '//table/@background',
    '//img/@src',
    '//input/@src',
    '//a/@href',
    '//area/@href',
    '//img/@longdesc',
);

$dom = @DOMDocument::loadHtml($resultFromCurl);
$xpath = new DOMXPath($dom);

$urls = array();
foreach ($queries as $query) {
    foreach ($xpath->query($query) as $link) {
        if (preg_match('@\.(gif|jpe?g|png)$@', $link->textContent))
            $urls[$link->textContent] = true;
    }
}

if (preg_match_all('@\b[^\s]+\.(?:gif|jpe?g|png)\b@', $dom->textContent, $matches)) {
    foreach ($matches as $m) {
        $urls[$m[0]] = true;
    }
}

$urls = array_keys($urls);
var_dump($urls);

score 1 · Accepted Answer

将所有图像 url 收集到一个数组中，然后使用array_unique()删除重复项。

$my_image_links = array_unique( $my_image_links );
// No more duplicates

如果您真的想使用正则表达式执行此操作，那么我们可以假设每个图像名称将被',"或空格、制表符、换行符或行首、>、<以及您能想到的任何其他内容包围。那么，我们可以这样做：

$pattern = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($pattern, html_entity_decode($resultFromCurl), $matches);
$imgs = array_unique($matches[1]);

上面将捕获图像链接，例如：

<p>Hai guys look at this ==> http://blah.com/lolcats.JPEG</p>

活生生的例子

php - 从 HTML 中抓取唯一的图像 URL

2 回答 2

Related

Reference