php - 有没有一种方法可以有效地从表中识别/提取标题：Perl

Question

我正在尝试编写一个 perl 脚本来从文本文件中可用的任意表格数据生成 xml。为了讨论，假设我想从 linux 命令获取输出

 df -k

并将其解析为我的 perl 脚本并即时生成一个 xml。

示例 check_disk_usage.log

 Filesystem           1K-blocks      Used Available Use% Mounted on
 /dev/sda3             56776092   5431448  48413988  11% /
 /dev/sda1               101086     18993     76874  20% /boot
 tmpfs                  2021888         0   2021888   0% /dev/shm

现在为了生成 XML，我需要从这个表中提取标题并将它们存储在一个数组中以供以后使用（它们将用作 XML 中的开始和结束标记）我这样做的方式：

 open my $file, '<', "$dir/check_disk_usage.log"; 
 my $firstLine = <$file>; 
 close $file; 

 my (@header) = $firstLine =~ /(\S+)/g;

即我正在寻找所有一个或多个非空白模式（实际上是一个单词）并将它们保存在一个数组中。只要标题名称遵循单个单词的模式，这就可以正常工作

 eg Filesystem,1K-blocks,Used etc

但是，当遇到标题名称 sa“Mounted on”时，它将中断，因为“Mounted”和“on”都将被视为不同的模式，因此将存储为不同的数组元素。有没有一种方法可以有效地从表中识别/提取标题。

PS：我知道，我可以使用 awk 用一些东西代替有问题的模式，然后解析文件。但是我需要事先知道“违规模式”，这是不可行的，因为我打算为任何任意表格数据编写这个脚本。

PSS：虽然我正在使用 perl，但我也对其他解决方案持开放态度（例如 php 等）

感谢你的帮助。

score 1 · Accepted Answer

从数据的外观来看，值是分开的，每行都有空格。如果有些行有空格而有些没有，那么它不是分隔符。这导致使用掩码来确定在哪里拆分标题。

有点丑，但是：

#!/usr/bin/perl
# Read the file provided on STDIN and then determine the delimiters,
# printing the individual elements per line.

my @lines = map { chomp; $_ } <>;

# The mask indicates if a character has ever been a NON whitespace character
my @mask  = ();

foreach my $line (@lines) {
    my @line = split //, $line;
    foreach my $index (0..$#line) {
        $mask[$index] ||= $line[$index] =~ /\S/;
    }
}

# At this point the mask indicates where to split based on the zeros within it.
# Want to turn this into substr ranges.
# So 000011110000 would become 4, 4

my @substrings = (); # will contain [from, length]
my $last_transition = 0;
my $last_value = $mask[0];

# When it transitions from 0 to 1 or 1 to 0 the $last_transition is updated
# When the last value was a 1 it means it has stopped being a section and needs
# to be made into a split.
foreach my $index (1..$#mask) {
    if ($mask[$index] != $last_value) {
        if ($last_value) {
            push @substrings, [$last_transition, ($index + 1 - $last_transition)];
        }
        $last_transition = $index;
        $last_value = $mask[$index];
    }
}
# Handle the end of the line, which is considered a transition to 0
if ( $last_value ) {
    push @substrings, [$last_transition, ($#mask + 1 - $last_transition)];
}

# Just print them to show that it works, you would collect these instead.
foreach my $line (@lines) {
    foreach my $split (@substrings) {
        my $element = substr $line, $split->[0], $split->[1];
        $element =~ s/(?:^\s+|\s+$)//;
        print "$line -> $element\n";
    }
}

输出：

Filesystem           1K-blocks      Used Available Use% Mounted on -> Filesystem
Filesystem           1K-blocks      Used Available Use% Mounted on -> 1K-blocks
Filesystem           1K-blocks      Used Available Use% Mounted on -> Used 
Filesystem           1K-blocks      Used Available Use% Mounted on -> Available
Filesystem           1K-blocks      Used Available Use% Mounted on -> Use%
Filesystem           1K-blocks      Used Available Use% Mounted on -> Mounted on
/dev/sda3             56776092   5431448  48413988  11% / -> /dev/sda3
/dev/sda3             56776092   5431448  48413988  11% / -> 56776092 
/dev/sda3             56776092   5431448  48413988  11% / -> 5431448
/dev/sda3             56776092   5431448  48413988  11% / -> 48413988 
/dev/sda3             56776092   5431448  48413988  11% / -> 11% 
/dev/sda3             56776092   5431448  48413988  11% / -> /
/dev/sda1               101086     18993     76874  20% /boot -> /dev/sda1
/dev/sda1               101086     18993     76874  20% /boot -> 101086 
/dev/sda1               101086     18993     76874  20% /boot -> 18993 
/dev/sda1               101086     18993     76874  20% /boot -> 76874 
/dev/sda1               101086     18993     76874  20% /boot -> 20% 
/dev/sda1               101086     18993     76874  20% /boot -> /boot
tmpfs                  2021888         0   2021888   0% /dev/shm -> tmpfs
tmpfs                  2021888         0   2021888   0% /dev/shm -> 2021888 
tmpfs                  2021888         0   2021888   0% /dev/shm -> 0 
tmpfs                  2021888         0   2021888   0% /dev/shm -> 2021888 
tmpfs                  2021888         0   2021888   0% /dev/shm -> 0% 
tmpfs                  2021888         0   2021888   0% /dev/shm -> /dev/shm

显然，您会将第一行处理为元素，而不是将其打印出来。

php - 有没有一种方法可以有效地从表中识别/提取标题：Perl

1 回答 1

Related

Reference