perl - 从 SEC 网站高效下载 10-K 文件

Question

我使用以下 perl 代码从 SEC 网站大量下载 10-Ks。但是，我得到一个“内存不足！” 当脚本在处理一个特别大的 10-K 文件时明显卡住时，每隔几百个文件发送一条消息。任何想法如何避免这种“内存不足！” 大文件出错？

#!/usr/bin/perl
use strict;
use warnings;
use LWP;

my $ua = LWP::UserAgent->new;

open LOG , ">download_log.txt" or die $!;
######## make sure the file with the ids/urls is in the 
######## same folder as the perl script
open DLIST, "downloadlist.txt" or die $!;
my @file = <DLIST>;

foreach my $line (@file) {
        #next if 0.999 > rand ;
        #print "Now processing file: $line\n" ;
    my ($nr, $get_file) = split /,/, $line;
    chomp $get_file;
    $get_file = "http://www.sec.gov/Archives/" . $get_file;
    if ($get_file =~ m/([0-9|-]+).txt/ ) {
        my $filename = $nr . ".txt";
        open OUT, ">$filename" or die $!;
        print "file $nr \n";
        my $response =$ua->get($get_file);
        if ($response->is_success) {
            print OUT $response->content;
            close OUT;
        } else {
            print LOG "Error in $filename - $nr \n" ;
        }
    }
}

score 1 · Accepted Answer

我最近在使用线程和数千个 LWP 请求时遇到了类似的问题。从来没有弄清楚内存泄漏是什么，但切换到 HTTP::Tiny 解决了它。

从LWP到HTTP::Tiny很简单：

use HTTP::Tiny;

my $ua = HTTP::Tiny->new;

my $response =$ua->get($get_file);
if ($response->{success}) {
    print OUT $response->{content};

...当然HTTP::Tiny可以为您节省部分费用，例如LWP.

您也可以尝试LWP在循环中创建一个新对象，希望垃圾收集能够启动，但这对我也不起作用。LWP怪物体内有东西在泄漏。

编辑：尝试将 2gb 文件下载到字符串中也可能存在问题，镜像方法应该可以为您解决。

score 1 · Accepted Answer

只需LWP将响应数据直接存储到文件而不是HTTP::Response对象中。以这种方式编码也更简单

这是一个示例程序。我目前无法对其进行测试，但可以编译

我最近注意到很多人编写代码以在处理数据之前将整个文件读入内存，我不明白为什么它如此受欢迎。它会浪费内存，并且通常更难以以这种方式编写解决方案。我已将您的程序更改为每次读取下载列表文件的一行并直接使用它而不是将其存储到数组中

use strict;
use warnings 'all';

use LWP;

my $ua = LWP::UserAgent->new;

open my $dl_fh,  '<', 'downloadlist.txt' or die "Can't open download list file: $!";

open my $log_fh, '>', 'download_log.txt' or die "Can't open log file: $!";

STDOUT->autoflush;

while ( <$dl_fh> ) {

    # next if 0.999 > rand;
    # print "Now fetching file: $_";

    chomp;
    my ($num, $dl_file) = split /,/;

    unless ( $dl_file =~ /[0-9|-]+\.txt/ ) {
        print $log_fh qq{Skipping invalid file "$dl_file"\n};
        next;
    }

    my $url      = "http://www.sec.gov/Archives/$dl_file";
    my $filename = "$num.txt";

    print qq{Fetching file $filename\n};

    my $resp = $ua->get($url, ':content_file' => $filename);

    printf $log_fh qq{Download of "%s" %s\n},
            $filename,
            $resp->is_success ?
            'successful' :
            'FAILED: ' . $resp->status_line;
}

perl - 从 SEC 网站高效下载 10-K 文件

2 回答 2

Related

Reference