html - 使用 perl 的 LWP 读取网页 - 输出与下载的 html 页面不同

Question

我尝试访问和使用 NCBI 中的不同页面，例如
http://www.ncbi.nlm.nih.gov/nuccore/NM_000036 但是，当我使用 perl 的 LWP::Simple 'get' 函数时，我没有得到相同的结果手动保存页面时得到的输出（使用 Firefox 浏览器的“另存为 html”选项）。我从“get”函数中得到的缺少我需要的数据。

难道我做错了什么？我应该使用其他工具吗？

我的脚本是：

use strict;
use warnings;
use LWP::Simple;


my $input_name='GENES.txt';

open (INPUT, $input_name ) || die "unable to open $input_name";
open (OUTPUT,'>', 'Selected_Genes')|| die;

my $line;


while ($line = <INPUT>)
{

    chomp $line;
    print OUTPUT '>'.$line."\n";
    my $URL='http://www.ncbi.nlm.nih.gov/nuccore/'.$line;
#e.g:
#$URL=http://www.ncbi.nlm.nih.gov/nuccore/NM_000036

    my $text=gets($URL);
    print $text."\n";   
    $text=~m!\r?\n\r?\s+\/translation="((?:(?:[^"])\r?\n?\r?)*)"!;
    print OUTPUT $1."\n";

}

提前致谢！

score 3 · Accepted Answer

http://www.ncbi.nlm.nih.gov/nuccore/NM_000036上的页面做了很多 JavaScript 处理，还动态加载了一堆东西。LWP::UserAgent 不会为您执行此操作，因为它无法运行 JavaScript。

我建议您使用 Firebug 或 Chrome 开发人员工具查看浏览器中发生的情况。你会看到它向这个 URL 发出 XHR 请求：http ://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=289547499&db=nuccore&dopt=genbank&extrafeat=976&fmt_mask=0&retmode=html&withmarkup=on&log $= seqview&maxdownloadsize=1000000

现在我不确定这些参数是如何转换为NM_000036. .

由于这可能是一项公共服务，并且我假设您被允许获取该数据，因此您应该考虑询问他们是否有可以点击的适当 API，而不是屏幕从他们的网站上抓取内容。

score 1 · Accepted Answer

您正在搜索的内容是由 JavaScript 生成的。您需要解析您的 HTML（来自第一个响应）并找到所需数据的 ID：

<meta name="ncbi_uidlist" content="289547499" />

接下来，您需要向表单中的 URL 发出另一个请求：http ://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=ID_YOU_HAVE

像这样的东西（未经测试！）：我的 $URL=' http://www.ncbi.nlm.nih.gov/nuccore/ '.$line;

my $html=gets($URL);

my ($id) = $html =~m{name="ncbi_uidlist" \s+ content="([^"]+)"}xi;
if ($id) {
    $html=gets( "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=" . $id );
    $text=~m!\r?\n\r?\s+\/translation="((?:(?:[^"])\r?\n?\r?)*)"!;
    print OUTPUT $1."\n";
}

html - 使用 perl 的 LWP 读取网页 - 输出与下载的 html 页面不同

2 回答 2

Related

Reference