shell - 通过正则表达式格式化带脚注的文本

Question

我想将文本的注释转换为脚注的形式。这是文本的最小示例。

第一段。这是第一段的第一位[1]。这是第一段的第二位[2]。

[1] 第一段注释之一

[2] 第一段注释二

第二段。这是第二段的第一位[1]。这是第二段的第二位[2]。

[1] 第二段注释之一

[2] 第二段注释二

在每个段落的末尾，会有几个以标签 [1] 开头的注释。每个注释将形成一个段落。

我想要做的是使用乳胶语法将这些注释插入到文本中。示例文本的期望输出是，

第一段。这是第一段的第一个位置\footnote{annotation one of paragraph one}。这是第一段的第二位\footnote{annotation two of paragraph one}。

第二段。这是第二段的第一个位置\footnote{annotation one of 第二段}。这是第二段的第二位\footnote{annotation one of paragraph two}。

这不仅仅是通过匹配模式进行的简单替换。它可能必须以段落为基础执行。你认为最简单的方法是什么？

编辑：我想出了一个可能的解决方案来使用 sed。

删除注释前面的换行符，

第一段。这是第一段的第一位[1]。这是第一段的第二位[2]。[1] 第一段注释之一 [2] 第一段注释二

第二段。这是第二段的第一位[1]。这是第二段的第二位[2]。[1] 第二段注释一 [2] 第二段注释二

匹配模式

[1] 文本1 [1] 文本2 [2]

并将其替换为

文本 2 文本 1 [2]

基本上第一个 [1] 是应该插入注释的位置；[1] 和 [2] 之间的东西是要重定位的注释。

这些问题是相关的：仅针对特定行删除换行符/换行符如何在使用 sed 的模式之前删除换行符/换行符，但由于缺乏正则表达式知识，我无法使这些代码为我工作.

score 1 · Accepted Answer

从根本上说，sed是这项工作的错误工具。您也许可以编写一个sed脚本来预处理文件并生成一个新的sed脚本来处理该文件，但是当有许多更好的工具可以完成这项任务时，您就手足无措了。我会接触 Perl（但我在 20 多年前学会了 Perl，而 Python 才几年前），但 Python 也能够处理它，而且小心你甚至可以使用awk. 部分麻烦是您必须保存第一段的所有文本，直到到达第二段的开头；只有这样，您才能开始为第一段生成实际文本。

我认为sed即使sed脚本捕获了保留空间中的段落内容，“是错误的工具”注释仍然有效。这些将是不以方括号开头的行。问题是，当您遇到带有方括号的行时，您需要编写一个正则表达式，将行尾替换为保留空间以代替方括号的内容。这需要一种“动态正则表达式”。即使您知道一个段落中的脚注永远不会超过 9 个，因此您可以考虑某种将代码写出 9 次的 hack，在正确的位置编写替换字符串仍然存在问题。

这是 Perl 中的一个简单脚本——好吧，Perl 中的一个并不复杂的脚本——可以完成这项工作。“旋转循环”（三个嵌套循环）使其有点难以理解。

#!/usr/bin/env perl
use strict;
use warnings;

my $para = "";

TEXT:
while (<>)
{
NOTES:
    while (m/^\s*\[(\d+)]\s+(.*)/)
    {
        my $tag = $1;
        my $note = $2;
        $para =~ s/\[$tag]/\\footnote{$note}/m;
        while (<>)
        {
            last if $_ =~ m/^\s*\[/;
            if ($_ !~ m/^\s*$/)
            {
            print $para;
            $para = "";
            last NOTES;
            }
        }
        last TEXT if eof;
    }

    $para .= $_;
}

print "$para";

给定输入文件：

Paragraph one. This is the first place [1] of paragraph one. This is the second place [2] of paragraph one.

[1] annotation one of paragraph one

[2] annotation two of paragraph one

Paragraph two. This is the first place [1] of paragraph two. This is the second place [2] of paragraph two.

[1] annotation one of paragraph two

[2] annotation two of paragraph two

该文件中此脚本的输出是：

Paragraph one. This is the first place \footnote{annotation one of paragraph one} of paragraph one. This is the second place \footnote{annotation two of paragraph one} of paragraph one.

Paragraph two. This is the first place \footnote{annotation one of paragraph two} of paragraph two. This is the second place \footnote{annotation two of paragraph two} of paragraph two.

脚本有什么作用？

外部循环（标记为TEXT）将行读入$_直到 EOF。

标有标签的循环NOTES处理一段之后的材料，直到下一段的开头。它知道这是一个脚注行，因为它以方括号中的数字开头（可能用空格缩进，并且肯定在右方括号后有一个空格）。当它找到这样的一行时，将数字保存在中$tag，并且替换文本（必须是单行 - 此处没有扩展的多行脚注）保存在$note. 然后在保存的段落中方括号内第一次出现的标记被脚注符号和注释文本替换（这是在单次运行中几乎不可能的部分sed，并且鉴于脚注编号重复跨段落，甚至两次运行sed有问题）。完成替换后（不在乎是否没有匹配项可替换），它读取下一行，这就是循环（和头部）开始旋转的地方。如果新读取的行是注释行，则初始值last退出最内层while并返回到NOTES循环的下一次迭代。如果该行与空白行不匹配，那么我们必须刚刚阅读了下一段的第一行，所以打印上一段（现在有与要进行的替换一样多的替换），清空保存的段落，并退出NOTES循环。否则，请忽略注释中间的空白行。

在循环之后，检查我们是否得到了 EOF，如果有就退出主循环。否则，将刚刚读取的段落行添加到保存的段落中。

最后，打印最后保存的段落。

This has not been exhaustively tested. I've not generated paragraphs with references to missing notes, or notes without references, or notes out of sequence. I think it would 'handle' those by ignoring the issues; there'd still be a reference to the missing note, and unreferenced notes would simply not show up in the output. If the same note number reference appears twice in a paragraph but there's only one note number after the paragraph, the second and subsequent ones are ignored. If the same note number appears twice ('text[1] more[1]') and the notes after the paragraph repeat the number ('[1] note 1A', '[1] note 1B'), then the first will be replaced with 'note 1A' and the second with 'note 1B'. I've not tested multiline paragraphs (but I don't expect trouble). Multiline qualifiers aren't needed for the replacement regex because the reference to a tag cannot be split over lines and isn't anchored on a line.

处理多行脚注是读者的一项练习（并非完全无关紧要）。除此以外，在找到一个空行、另一个脚注行或下一段的开头之前，您不能开始替换多行脚注。

score 0 · Accepted Answer

一个不那么冗长（并且记录较少）的 perl 版本

perl -00 -pe '
    @markers = m{(\[\d+\])}g;
    for $i (0..$#markers) {
        $footnote = <>;
        ($marker, $text) = $footnote =~ m{(\[\d+\])\s+(.*)};
        s{\Q$marker\E}{\\footnote{$text}};
    }
' file

这假设如果一个段落中有 5 个脚注标记，则该段落后面将有 5 个脚注。

shell - 通过正则表达式格式化带脚注的文本

2 回答 2

Related

Reference