perl - 如何在 Perl 中散列

Question

我在日志文件中找到唯一 URL 以及可以使用 $line[7] 获得的响应标记。我正在使用 Hash 来获取唯一的 URL。

如何获取唯一 URL 的计数？
如何获得响应时间的平均值以及唯一 URL 的计数？

使用下面的代码，我得到

url1
url2
url3

但我想要它以及每个 URL 的平均响应时间和计数

URL     Av.RT   Count
url1    10.5    125
url2    9.3     356
url3    7.8     98

代码：

#!/usr/bin/perl
open(IN, "web1.txt") or die "can not open file";

# Hash to store final list of unique IPs
my %uniqueURLs = ();
my $z; 
# Read log file line by line
while (<IN>) {
@line = split(" ",$_);
$uniqueURLs{$line[9]}=1;
}

# Go through the hash table and print the keys
# which are the unique IPs
for $url (keys %uniqueURLs) {
print $url . "\n";
}

score 2 · Accepted Answer

阅读参考资料。此外，阅读现代 Perl 实践，这将有助于提高您的编程技能。

您可以将信息存储在这些哈希中，而不仅仅是使用唯一 URL 哈希的键。让我们从唯一 URL 的计数开始：

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;
use feature qw(say);

use constant {
    WEB_FILE => "web1.txt",
};

open my $web_fh, "<", WEBFILE;   #Autodie will catch this for you
my %unique_urls;
while ( my $line = <$web_fh> ) {
    my $url = (split /\s+/, $line)[9];
    if ( not exists $unique_urls{$url} ) {  #Not really needed
        $unique_urls{$url} = 0;
    }
    $unique_urls{$url} += 1;
}
close $web_fh;

现在，您的 %unique_urls 哈希中的每个键都将包含您拥有的唯一 URL 的数量。

顺便说一句，这是您的代码以更现代的风格编写的。use strict;和pragma将捕获大约 90% 的标准编程错误。将捕获您忘记检查的事物的异常。在这种情况下，如果文件不存在，程序将自动终止。use warnings; use autodie;

该open命令的三参数版本是首选，文件句柄使用标量变量也是如此。为文件句柄使用标量变量使它们更容易在子例程中传递，并且如果文件句柄超出范围，文件将自动关闭。

但是，我们希望每个哈希存储两个项目。我们想要存储唯一计数，并且我们想要存储一些可以帮助我们找到平均响应时间的东西。这就是引用的来源。

在 Perl 中，变量处理单个数据项。标量变量（如$foo）处理单个数据项。数组和散列（如@foo和%foo）处理单个数据项的列表。参考资料可帮助您绕过此限制。

我们来看一组人：

$person[0] = "Bob";
$person[1] = "Ted";
$person[2] = "Carol";
$person[3] = "Alice";

然而，人不仅仅是名字。他们有姓氏、电话号码、地址等。让我们看一下 Bob 的哈希：

my %bob_hash;
$bob_hash{FIRST_NAME} = "Bob";
$bob_hash{LAST_NAME} = "Jones";
$bob_hash{PHONE} = "555-1234";

我们可以通过在它前面放一个反斜杠来引用这个哈希。引用只是存储此哈希的内存地址：

$bob_reference = \%bob_hash;
print "$bob_reference\n":   # Prints out something like HASH(0x7fbf79004140)

但是，该内存地址是一个单独的项目，并且可以存储在我们的人员数组中！

$person[0] = $bob_reference;

如果我们想访问引用中的项目，我们通过将正确的数据类型符号放在前面来取消引用它。由于这是一个哈希，我们将使用%：

$bob_hash = %{ $person[0] };

->Perl 提供了一种使用以下语法取消引用哈希的简单方法：

$person[0]->{FIRST_NAME} = "Bob";
$person[0]->{LAST_NAME}  = "Jones";
$person[0]->{PHONE}  = "555-1212";

我们将使用相同的技术%unique_urls来存储次数和响应时间的总量。（平均值为总时间/次数）。

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;
use feature qw(say);

use constant {
    WEB_FILE => "web1.txt",
};

open my $web_fh, "<", WEB_FILE;   #Autodie will catch this for you
my %unique_urls;
while ( my $line ( <$web_fh> ) {
    my $url = (split /\s+/, $line)[9];
    my $response_time = (split /\s+/, $line)[10];   #Taking a guess        
    if ( not exists $unique_urls{$url} ) {  #Not really needed
        $unique_urls{$url}->{INSTANCES} = 0;
        $unique_urls{$url}->{TOTAL_RESP_TIME} = 0;
    }
    $unique_urls{$url}->{INSTANCES} += 1;
    $unique_urls{$url}->{TOTAL_RESP_TIME} += $response_time;
}
$close $web_fh;

现在我们可以将它们打印出来：

print "%20.20s  %6s  %8s\n", "URL", "INST", "AVE";
for my $url ( sort keys %unique_urls ) {
    my $total_resp_time = $unique_urls{$url}->{TOTAL_RESP_TIME};
    my $instances = $unique_urls{$url}->{INSTANCES};
    my $average = $total_resp_time / $instances
    printf "%-20.20s  %-6d  %-8.5f\n", $url, $instances, $average";
}

我喜欢用printf在桌子上。

score 2 · Accepted Answer

将 listref 存储在您的哈希目录中：

$uniqueURLs{$line[9]} = [ <avg response time>, <count> ];

相应地调整元素，例如。计数：

if (defined($uniqueURLs{$line[9]})) {
    # url known, increment count,
    # update average response time with data from current log entry
    $uniqueURLs{$line[9]}->[0] =
        (($uniqueURLs{$line[9]}->[0] * $uniqueURLs{$line[9]}->[1]) + ($line[7] + 0.0))
           / ($uniqueURLs{$line[9]}->[1] + 1)
    ;
    $uniqueURLs{$line[9]}->[1] += 1;
}
else {
    # url not yet known,
    # init count with 1 and average response time with actual response time from log entry 
    $uniqueURLs{$line[9]} = [ $line[7] + 0.0, 1 ];
}

打印结果：

# Go through the hash table and print the keys
# which are the unique IPs
for $url (keys %uniqueURLs) {
    printf ( "%s %f %d\n", $url, $uniqueURLs{$url}->[0], $uniqueURLs{$url}->[1]);
}

添加0.0将保证从字符串到浮点的类型强制作为保障。

score 0 · Accepted Answer

而不是在此处将值设置为 1：

$uniqueURLs{$line[9]}=1;

存储一个数据结构，指示响应时间和该 URL 已被看到的次数（以便您可以正确计算平均值）。如果需要，您可以使用数组 ref 或 hashref。如果该键还不存在，则表示它还没有被看到，您可以设置一些初始值。

# Initialize 3-element arrayref: [count, total, average]
$uniqueURLS{$line[9]}       = [0, 0, 0] if not exists $uniqueURLS{$line[9]};
$uniqueURLs{$line[9]}->[0]++;           # Count
$uniqueURLs{$line[9]}->[1] += $line[7]; # Total time
# Calculate average
$uniqueURLs{$line[9]}->[2]  = $uniqueURLs{$line[9]}->[1] / $uniqueURLs{$line[9]}->[0];

您可以获得 uniqueURLS 计数的一种方法是计算键：

print scalar(keys %uniqueURLS); # Print number of unique url's

在您的循环中，您可以像这样打印出 url 和平均时间：

for $url (keys %uniqueURLs) {
   print $url, ' - ', $uniqueURLs[$url]->[2], "seconds \n";
}

perl - 如何在 Perl 中散列

3 回答 3

Related

Reference