php - 使用 php 搜索非常大的文件以非常有效地提取块

Question

最近我一直为从视频文件中解析元数据而头疼，并发现部分问题是视频制作软件供应商无视各种标准（或至少是解释上的差异）（以及其他原因）。

因此，我需要能够扫描各种格式、容器和编解码器的非常大的视频（和图像）文件，并挖掘元数据。我已经有了 FFMpeg、ExifTool Imagick 和 Exiv2 来处理各种文件类型中不同类型的元数据，并通过各种其他选项来填补其他一些空白（请不要建议库或其他工具，我已经尝试过所有这些：））。

现在我要扫描大文件（每个最大 2GB）以查找 XMP 块（通常由 Adobe 套件和其他一些软件写入电影文件）。我已经编写了一个函数来做到这一点，但我担心它可以改进。

function extractBlockReverse($file, $searchStart, $searchEnd)
{
    $handle = fopen($file, "r");
    if($handle)
    {
        $startLen = strlen($searchStart);
        $endLen = strlen($searchEnd);

        for($pos = 0, 
                $output = '', 
                $length = 0, 
                $finished = false, 
                $target = '';
            $length < 10000 && 
                !$finished && 
                fseek($handle, $pos, SEEK_END) !== -1; 
            $pos--)
        {
            $currChar = fgetc($handle);
            if(!empty($output))
            {
                $output = $currChar . $output;
                $length++;

                $target = $currChar . substr($target, 0, $startLen - 1);
                $finished = ($target == $searchStart);
            }
            else
            {
                $target = $currChar . substr($target, 0, $endLen - 1);
                if($target == $searchEnd)
                {
                    $output = $target;
                    $length = $length + $endLen;
                    $target = '';
                }
            }
        }

        fclose($handle);
        return $output;
    }
    else
    {
        throw new Exception('not found file');
    }
    return false;
}

echo extractBlockReverse("very_large_video_file.mov", 
    '<x:xmpmeta', 
    '</x:xmpmeta>');

目前还可以，但我真的很想在这里充分利用 php 而不会破坏我的服务器，所以我想知道是否有更好的方法来做到这一点（或对代码进行调整以改进它）因为这种方法对于一些简单的事情来说似乎有点过头了，比如找到几根弦然后拉出它们之间的任何东西。

score 3 · Accepted Answer

您可以使用一种快速字符串搜索算法——如Knuth-Morris-Pratt 或Boyer-Moore来查找开始和结束标签的位置，然后读取它们之间的所有数据。

You should measure their performance though, as with such small search patterns it might turn out that the constant of the chosen algorithm is not good enough for it to be worth it.

score 1 · Accepted Answer

With files this big, I think that the most important optimization would be to NOT search the string everywhere. I don't believe that a video or image will ever have a XML block smack in the middle - or if it has, it will likely be garbage.

Okay, it IS possible - TIFF can do this, and JPEG too, and PNG; so why not video formats? But in real world applications, loose-format metadata such as XMP are usually stored last. More rarely, they are stored near the beginning of the file, but that's less common.

Also, I think that most XMP blocks will not have sizes too great (even if Adobe routinely pads them in order to be able to "almost always" quickly update them in-place).

So my first attempt would be to extract the first, say, 100 Kb and last 100 Kb of information from the file. Then scan these two blocks for "

If the search does not succeed, you will still be able to run the exhaustive search, but if it succeeds it will return in one ten-thousandth of the time. Conversely, even if this "trick" only succeeded one time in one thousand, it would still be worthwhile.

php - 使用 php 搜索非常大的文件以非常有效地提取块

2 回答 2

Related

Reference