perl - 可以连接两个使用不同输入记录分隔符的 Perl 脚本吗？

Question

两个 Perl 脚本，使用不同的输入记录分隔符，协同工作将 LaTeX 文件转换为易于搜索的人类可读的短语和句子。当然，它们可以通过一个 shell 脚本封装在一起。但我很好奇它们是否可以合并到单个 Perl 脚本中。

这些脚本的原因：例如，在 short.tex 中找到“二三”会很麻烦。但是转换后，grep '二三'会返回第一段。

对于任何 LaTeX 文件（此处为 short.tex），脚本按如下方式调用。

cat short.tex | try1.pl | try2.pl

try1.pl 适用于段落。它摆脱了 LaTeX 评论。它确保每个单词与相邻单词之间用一个空格隔开，这样单词之间就不会出现偷偷摸摸的制表符、表单提要等。生成的段落占一行，由由单个空格分隔的可见字符组成——最后是至少两个换行符的序列。

try2.pl 啜饮整个文件。它确保段落之间正好用两个换行符隔开。它确保文件的最后一行是重要的，包含可见字符。

是否可以优雅地将诸如此类依赖于不同输入记录分隔符的两个操作连接到单个 Perl 脚本中，比如 big.pl？例如，try1.pl 和 try2.pl 的工作是否可以通过两个函数或较大脚本中的括号段来完成？

_{顺便说一句，“输入记录分隔符”是否有 Stack Overflow 关键字？}

###文件try1.pl：

#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = ""; # input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
while (<>) {
    s/[\x25].*\n/ /g; # remove all LaTeX comments. They start with %
    s/[\t\f\r ]+/ /g; # collapse each "run" of whitespace to one single space
    s/^\s*\n/\n/g; # any line that looks blank is converted to a pure newline;
    s/(.)\n/$1/g; # Any line that does not look blank is joined to the subsequent line
    print;
    print "\n\n"; # make sure each paragraph is separated from its fellows by newlines
}

###文件try2.pl：

#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = undef; # input record separator: entire text or file is a single record.
while (<>) {
    s/[\n][\n]+/\n\n/g;    # exactly 2 blank lines separate paragraphs. Like cat -s
    s/[\n]+$/\n/; # last line is nontrivial; no blank line at the end
    print;
}

###文件short.tex：

\paragraph{One}
% comment
two % also 2
three % or 3

% comment
% comment

% comment
% comment

% comment

% comment

So they said%
that they had done it.

% comment
% comment
% comment





Fleas.

% comment

% comment

转换后：

\paragraph{One} two three

So they said that they had done it.

Fleas.

score 1 · Accepted Answer

要将其合并try1.pl到try2.pl一个脚本中，您可以尝试：

local $/ = "";
my @lines;
while (<>) {
    [...]    # Same code as in try1.pl except print statements
    push @lines, $_;
}

$lines[-1] =~ s/\n+$/\n/;
print for @lines;

score 1 · Accepted Answer

管道将一个进程的输出连接到另一个进程的输入。双方都不知道对方，也不关心对方是如何运作的。

但是，像这样把东西放在一起打破了 Unix 管道哲学的小工具，每个小工具都擅长非常狭窄的工作。如果您将这两件事联系起来，即使您想要一项任务，您也必须始终执行两项任务（尽管您可以进入配置以关闭一项任务，但工作量很大）。

我处理了很多 LaTeX，并通过Makefile控制一切。我并不真正关心命令的外观，我什至不必记住它们是什么：

short-clean.tex: short.tex
    cat short.tex | try1.pl | try2.pl > $@

无论如何，让我们做吧

我会将自己限制在基本串联的约束中，而不是完全重写或重新排列，主要是因为有一些有趣的事情要展示。

考虑一下如果您通过在第一个程序的文本末尾添加第二个程序的文本来连接这两个程序会发生什么。

原始第一个程序的输出仍然是标准输出，而第二个程序现在没有将该输出作为输入。
程序的输入可能已被最初的第一个程序耗尽，而第二个程序现在没有可读取的内容。这很好，因为它会读取第一个程序的未处理输入。

有多种方法可以解决这个问题，但是当您已经有两个工作程序可以完成它们的工作时，它们都没有多大意义。我会把它放在Makefile中并忘记它。

但是，假设您确实希望将所有内容都放在一个文件中。

重写第一部分以将其输出发送到连接到字符串的文件句柄。它的输出现在在程序内存中。这基本上使用相同的接口，您甚至可以使用它select来使其成为默认文件句柄。
重写第二部分以从连接到该字符串的文件句柄中读取。

或者，您可以通过在第一部分写入临时文件，然后在第二部分读取该临时文件来做同样的事情。

一个更复杂的程序将第一个程序写入第二个程序同时读取的管道（程序内部）。但是，您几乎必须重写所有内容，以便两个程序同时发生。

这是程序 1，它将大多数字母大写：

#!/usr/bin/perl
use v5.26;
$|++;
while( <<>> ) { # safer line input operator
    print tr/a-z/A-Z/r;
    }

这是程序 2，它折叠空格：

#!/usr/bin/perl
use v5.26;
$|++;
while( <<>> ) { # safer line input operator
    print s/\s+/ /gr;
    }

他们连续工作以完成工作：

$ perl program1.pl
The quick brown dog jumped over the lazy fox.
THE QUICK BROWN DOG JUMPED OVER THE LAZY FOX.
^D

$ perl program2.pl
The quick     brown dog jumped        over the lazy fox.
The quick brown dog jumped over the lazy fox.
^D

$ perl program1.pl | perl program2.pl
The quick     brown dog jumped        over the lazy fox.
THE QUICK BROWN DOG JUMPED OVER THE LAZY FOX.
^D

现在我想把这些结合起来。首先，我会做一些不影响操作但以后会更容易的更改。我不会使用隐式文件句柄，而是将这些显式文件句柄从实际文件句柄中删除：

方案一：

#!/usr/bin/perl
use v5.26;
$|++;
my $output_fh = \*STDOUT;
while( <<>> ) { # safer line input operator
    print { $output_fh } tr/a-z/A-Z/r;
    }

方案二：

#!/usr/bin/perl
$|++;
my $input_fh = \*STDIN;
while( <$input_fh> ) { # safer line input operator
    print s/\s+/ /gr;
    }

现在我有机会在不影响程序内容的情况下更改这些文件句柄。while不知道也不关心那个文件句柄是什么，所以让我们先在程序 1 中写入一个文件，然后在程序 2 中从同一个文件中读取：

方案一：

#!/usr/bin/perl
use v5.26;
open my $output_fh, '>', 'program1.out' or die "$!";
while( <<>> ) { # safer line input operator
    print { $output_fh } tr/a-z/A-Z/r;
    }
close $output_fh;

方案二：

#!/usr/bin/perl
$|++;
open my $input_fh, '<', 'program1.out' or die "$!";
while( <$input_fh> ) { # safer line input operator
    print s/\h+/ /gr;
    }

但是，您不能再在管道中运行这些，因为程序 1 不使用标准输出并且程序 2 不读取标准输入：

% perl program1.pl
% perl program2.pl

但是，您现在可以加入计划、shebang 和所有：

#!/usr/bin/perl
use v5.26;

open my $output_fh, '>', 'program1.out' or die "$!";
while( <<>> ) { # safer line input operator
    print { $output_fh } tr/a-z/A-Z/r;
    }
close $output_fh;

#!/usr/bin/perl
$|++;
open my $input_fh, '<', 'program1.out' or die "$!";
while( <$input_fh> ) { # safer line input operator
    print s/\h+/ /gr;
    }

您可以跳过文件并改用字符串，但此时，您已经超越了仅仅连接文件的范围，还需要进行一些协调以使它们与数据共享标量。尽管如此，程序的核心并不关心你是如何制作这些文件句柄的：

#!/usr/bin/perl
use v5.26;

my $output_string;

open my $output_fh, '>', \ $output_string or die "$!";
while( <<>> ) { # safer line input operator
    print { $output_fh } tr/a-z/A-Z/r;
    }
close $output_fh;

#!/usr/bin/perl
$|++;
open my $input_fh, '<', \ $output_string or die "$!";
while( <$input_fh> ) { # safer line input operator
    print s/\h+/ /gr;
    }

所以让我们更进一步，做 shell 已经为我们做的事情。

#!/usr/bin/perl
use v5.26;

pipe my $input_fh, my $output_fh;
$output_fh->autoflush(1);

while( <<>> ) { # safer line input operator
    print { $output_fh } tr/a-z/A-Z/r;
    }
close $output_fh;

while( <$input_fh> ) { # safer line input operator
    print s/\h+/ /gr;
    }

从这里开始，它变得有点棘手，我不打算通过轮询文件句柄进行下一步，这样一件事可以写，下一件事可以读。有很多事情可以为您做到这一点。而且，您现在正在做很多工作来避免已经简单且有效的事情。

下一步是将代码分成函数（可能在库中），而不是所有这些废话，并将这些代码块作为隐藏其详细信息的命名事物处理：

use Local::Util qw(remove_comments minify);

while( <<>> ) {
    my $result = remove_comments($_);
    $result = minify( $result );
    ...
    }

这可以变得更加有趣，您只需通过一系列步骤而不知道它们是什么或将有多少。而且，由于所有的婴儿步骤都是独立的，你基本上回到了管道的概念：

use Local::Util qw(get_input remove_comments minify);

my $result;
my @steps = qw(get_input remove_comments minify)
while( ! eof() ) {  # or whatever
    no strict 'refs'
    $result = &{$_}( $result ) for @steps;
    }

一个更好的方法是把它变成一个对象，这样你就可以跳过软引用：

use Local::Processor;

my @steps = qw(get_input remove_comments minify);
my $processer = Local::Processor->new( @steps );

my $result;
while( ! eof() ) {  # or whatever
    $result = $processor->$_($result) for @steps;
    }

就像我之前所做的那样，程序的核心并不关心或提前知道这些步骤。这意味着您可以将步骤顺序移动到配置中，并为任何组合和顺序使用相同的程序：

use Local::Config;
use Local::Processor;

my @steps = Local::Config->new->get_steps;
my $processer = Local::Processor->new;

my $result;
while( ! eof() ) {  # or whatever
    $result = $processor->$_($result) for @steps;
    }

我在Mastering Perl和Effective Perl Programming中写了很多关于这类东西的文章。但是，因为你能做到并不意味着你应该这样做。这重新发明了很多make已经可以为你做的事情。我不会在没有充分理由的情况下做这种事情——<code>bash 并且make必须非常烦人才能激励我走到这一步。

score 0 · Accepted Answer

激励问题是生成 LaTeX 文件的“清理”版本，使用正则表达式可以很容易地搜索复杂的短语或句子。

下面的单个 Perl 脚本完成了这项工作，而之前我需要一个 shell 脚本和两个 Perl 脚本，需要三个 Perl 调用。这个新的单一脚本包含三个连续的循环，每个循环都有不同的输入记录分隔符。

第一个循环：

input = STDIN，或作为参数传递的文件；记录分隔符=默认，逐行循环；将结果打印到 fileafterperlLIN，这
是硬盘上的一个临时文件。
第二个循环：

输入=文件后perlLIN；
记录分隔符 = ""，逐段循环；
打印结果到 fileafterperlPRG，硬盘上的一个临时文件。
第三个循环：

输入=文件后perlPRG；
记录分隔符 = undef，将整个文件
打印结果啜饮到 STDOUT

这样做的缺点是打印和读取硬盘上的两个文件，这可能会减慢速度。优点是操作似乎只需要一个过程；并且所有代码都驻留在一个文件中，这应该更容易维护。

#!/usr/bin/perl
# 2019v04v05vFriv17h18m41s

use strict;
use warnings;
use 5.18.2;

my $diagnose;
my $diagnosticstring;
my $exitcode;
my $userName =  $ENV{'LOGNAME'};
my $scriptpath;
my $scriptname;
my $scriptdirectory;
my $cdld;
my $fileafterperlLIN;
my $fileafterperlPRG;
my $handlefileafterperlLIN;
my $handlefileafterperlPRG;
my $encoding;
my $count;

sub diagnosticmessage {
    return unless ( $diagnose );
    print STDERR "$scriptname: ";
    foreach $diagnosticstring (@_) {
        printf STDERR "$diagnosticstring\n";
    }
}

# Routine setup
$scriptpath = $0;
$scriptname = $scriptpath;
$scriptname =~ s|.*\x2f([^\x2f]+)$|$1|;
$cdld = "$ENV{'cdld'}"; # A directory to hold temporary files used by scripts
$exitcode = system("test -d $cdld && test -w $cdld || { printf '%\n' 'cdld not a writeable directory'; exit 1; }");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;

$scriptdirectory = "$cdld/$scriptname"; # To hold temporary files used by this script
$exitcode = system("test -d $scriptdirectory || mkdir $scriptdirectory");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
diagnosticmessage ( "scriptdirectory=$scriptdirectory" );
$exitcode = system("test -w $scriptdirectory && test -x $scriptdirectory || exit 1;");
die "$scriptname: system returned exitcode=$exitcode: $scriptdirectory not writeable or not executable. bail\n" unless $exitcode == 0;
$fileafterperlLIN = "$scriptdirectory/afterperlLIN.tex";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
$exitcode = system("printf '' > $fileafterperlLIN;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
$fileafterperlPRG = "$scriptdirectory/afterperlPRG.tex";
diagnosticmessage ( "fileafterperlPRG=$fileafterperlPRG" );
$exitcode=system("printf '' > $fileafterperlPRG;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;

# This script's job: starting with a LaTeX file, which may compile beautifully in pdflatex but be difficult
# to read visually or search automatically,
# (1) convert any line that looks blank --- a "trivial line", containing only whitespace --- to a pure newline. This is because
#     (a) LaTeX interprets any whitespace line following a non-blank or "nontrivial" line as end of paragraph, whereas
#     (b) Perl needs two consecutive newlines to signal end of paragraph.
# (2) remove all LaTeX comments;
# (3) deal with the \unskip LaTeX construct, etc.
# The result will be
# (4) each LaTeX paragraph will occupy a unique line
# (5) exactly one pair of newlines --- visually, one blank line --- will divide each pair of consecutive paragraphs
# (6) first paragraph will be on first line (no opening blank line) and last paragraph will be on last line (no ending blank line)
# (7) whitespace in output will consist of only
#     (a) a single space between readable strings, or
#     (b) double newline between paragraphs
#
$handlefileafterperlLIN = undef;
$handlefileafterperlPRG = undef;
$encoding = ":encoding(UTF-8)";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
open($handlefileafterperlLIN, ">> $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for appending: $!";

# Loop 1 / line:
# Default input record separator: loop through one line at a time, delimited by \n
$count = 0;
while (<>) {
    $count = $count + 1;
    diagnosticmessage ( "line $count" );
    s/^\s*\n/\n/mg; # Convert any trivial line to a pure newline.
    print $handlefileafterperlLIN $_;
}

close($handlefileafterperlLIN);
open($handlefileafterperlLIN, "< $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for reading: $!";
open($handlefileafterperlPRG, ">> $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for appending: $!";

# Loop PRG / paragraph:
local $/ = ""; # Input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
$count = 0;
while (<$handlefileafterperlLIN>) {
    $count = $count + 1;
    diagnosticmessage ( "paragraph $count" );
    s/(?<!\x5c)[\x25].*\n/ /g; # Remove all LaTeX comments.
    #    They start with % not \% and extend to end of line or newline character. Join to next line.
    #    s/(?<!\x5c)([\x24])/\x2a/g; # 2019v04v01vMonv13h44m09s any $ not preceded by backslash \, replace $ by * or something.
    #    This would be only if we are going to run detex on the output.
    s/(.)\n/$1 /g; # Any line that has something other than newline, and then a newline, is joined to the subsequent line
    s|([^\x2d])\s*(\x2d\x2d\x2d)([^\x2d])|$1 $2$3|g; # consistent treatment of triple hyphen as em dash
    s|([^\x2d])(\x2d\x2d\x2d)\s*([^\x2d])|$1$2 $3|g; # consistent treatment of triple hyphen as em dash, continued
    s/[\x0b\x09\x0c\x20]+/ /gm; # collapse each "run" of whitespace other than newline, to a single space.
    s/\s*[\x5c]unskip(\x7b\x7d)?\s*(\S)/$2/g; # LaTeX whitespace-collapse across newlines
    s/^\s*//; # Any nontrivial line: No indenting. No whitespace in first column.
    print $handlefileafterperlPRG $_;
    print $handlefileafterperlPRG "\n\n"; # make sure each paragraph ends with 2 newlines, hence at least 1 blank line.
}
close($handlefileafterperlPRG);

open($handlefileafterperlPRG, "< $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for reading: $!";

# Loop slurp
local $/ = undef;  # Input record separator: entire file is a single record.
$count = 0;
while (<$handlefileafterperlPRG>) {
    $count = $count + 1;
    diagnosticmessage ( "slurp $count" );
    s/[\n][\n]+/\n\n/g;  # Exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
    s/[\n]+$/\n/;        # Last line is visible or "nontrivial"; no trivial (blank) line at the end
    s/^[\n]+//;          # No trivial (blank) line at the start. The first line is "nontrivial."
    print STDOUT;
}

perl - 可以连接两个使用不同输入记录分隔符的 Perl 脚本吗？

3 回答 3

无论如何，让我们做吧

Related

Reference