perl - 使用 sed（或 awk、perl 等）来识别第一次出现的降价标题

Question

我有一系列带有 yaml 标头的文件，后跟 markdown 字幕，看起来像这样：

最小示例输入文件：

---
layout: post
tags: 
  - might 
  - be
  - variable 
  - number 
  - of 
  - these
category: ecology
---



my (h2 size) title
------------------

some text


possible other titles we don't want
-----------------------------------

more text more text

正如我试图指出的那样，YAML 标题的大小和出现第一个字幕的行会有所不同，所以我不能指望提前知道任何更改的行号。我想确定第一个标题（也应该是结束后的第一个非空白文本---。然后我想将该文本写入 YAML 标题，就像我们从正文中删除的图块一样，其余文本保持不变：

目标输出文件

---
layout: post
tags: 
  - might 
  - be
  - variable 
  - number 
  - of 
  - these
categories: ecology
title: my (h2 size) title
---



some text

possible other titles we don't want
-----------------------------------

more text more text

似乎这对于 sed/awk 等来说应该是一项合理的任务，但我对这些工具的使用非常初级，我无法把这个弄糊涂。

我看到我可以在单词之间搜索sed 'word1/,/word2/p，但不确定如何将其转换为在第二次出现^---$和第一次出现之间搜索^----+-$（超过 3 个破折号的行）；然后如何删除多余的空白行，然后粘贴到上面的 yaml 问题中。

也许有这么多步骤 perl 会比 sed 更好，但我对它的熟悉程度更低。感谢您的任何提示或建议。

score 2 · Accepted Answer

只需执行 2 次 - 第一次（当 NR==FNR 时）找到您希望之前打印的标题和行号，第二次在行号合适时打印它和其他行：

$ cat tst.awk
NR==FNR {
   if (hdrEnd && !title && NF)  {title = $0; titleStart=FNR; titleEnd=FNR+1 }
   if (hdrStart && /^---$/)     {hdrEnd   = FNR }
   if (!hdrStart && /^---$/)    {hdrStart = FNR }
   next
}
FNR == hdrEnd { print "title:", title }
(FNR < titleStart) || (FNR > titleEnd)

$ awk -f tst.awk file file      
---
layout: post
tags: 
  - might 
  - be
  - variable 
  - number 
  - of 
  - these
category: ecology
title: my (h2 size) title
---




some text


possible other titles we don't want
-----------------------------------

more text more text

hdrStart 是标题开始的行号等。如果您想跳过标题周围的更多行而不仅仅是文本和后续下划线行，只需将 titleStart 和 titleEnd 填充为 FNR-1 和 FNR+2 或其他. FNR（文件记录数）是当前打开的文件中的当前行号，而 NR（记录数）是到目前为止所有以前和当前打开的文件中读取的总行数。

如果您不想在命令行上两次指定文件名，可以在 awks BEGIN 部分复制它：

$ cat tst.awk             
BEGIN{ ARGV[ARGC++] = ARGV[ARGC-1] }
NR==FNR {
   if (hdrEnd && !title && NF)  {title = $0; titleStart=FNR; titleEnd=FNR+1 }
   if (hdrStart && /^---$/)     {hdrEnd   = FNR }
   if (!hdrStart && /^---$/)    {hdrStart = FNR }
   next
}
FNR == hdrEnd { print "title:", title }
(FNR < titleStart) || (FNR > titleEnd)

那么您只需将脚本调用为：

$ awk -f tst.awk file

编辑：实际上 - 这是一个不做 2-pass 方法并且可以说更简单的替代方案：

$ cat tst.awk
(state == 0) && /^---$/ { state=1; print; next }
(state == 1) && /^---$/ { state=2; next }
(state == 2) && /^./    { state=3; printf "title: %s\n---\n",$0; next }
(state == 3) && /^-+$/  { state=4; next }

state != 2 { print }

$ awk -f tst.awk file
---
layout: post
tags: 
  - might 
  - be
  - variable 
  - number 
  - of 
  - these
category: ecology
title: my (h2 size) title
---

some text


possible other titles we don't want
-----------------------------------

more text more text

如果您熟悉状态机，那么它在做什么应该很明显，如果不让我知道。

score 1 · Accepted Answer

一个快速而肮脏的 perl 代码：

$/=undef;  # null line delimiter, so that the following reads the full file
my $all=<STDIN>;
my @parts=split(/^(----*)$/m,$all); # split in sections delimited by all-dashes linse
my @head=split("\n",$parts[2]);  # split the header in lines
my @tit=split("\n",$parts[4]);  # split the title section in lines
push @head,pop @tit;            # remove the last line from the title section and append to head
$parts[2]=join("\n",@head)."\n"; # rebuild the header
$parts[4]=join("\n",@tit);       # rebuild the title section
print join("",@parts);           # rebuild all and print to stdout

这对您来说可能不够健壮：它不关心是否有 3 个或更多破折号，它假定 UNIX 换行符，不检查标题是否非空白等。但它可能作为起点有用，或者如果您只需要运行一次。另一种方法是读取数组中内存中的所有行，循环查找分隔线并移动标题行。

score 0 · Accepted Answer

好老蟒蛇：

with open("i.yaml") as fp:
    lines = fp.readlines()

c = False
i = 0
target = -1

for line in lines:
    i += 1
    if c:
        if line.strip() != "":
            source = i - 1
            c = False

    if line.strip() == "---":
        if i > 1:
            c = True
            target = i - 1

lines[target:target] = ["title: " + lines[source]]
del lines[source + 1]
del lines[source + 1]

with open("o.yaml", "w") as fp:
    fp.writelines(lines)

score 0 · Accepted Answer

也许这个 Perl 代码会帮助您找到解决方案：

#!/usr/bin/env perl

use Modern::Perl;
use File::Slurp;

my @file_content = read_file('test.yml');
my ($start, $stop, $title);
foreach my $line (@file_content) {

    if ($line =~ m{ --- }xms) {
        if (!$start) {
            $start = 1;
        }
        else {
            $stop = 1;
            next;
        }
    }    

    if ($line && $stop && $line = m{\w}xms) {
        $title = $line;
        last;
    }


}

say "Title: $title";

上面的数据输出： 标题：我的（h2 大小）标题

perl - 使用 sed（或 awk、perl 等）来识别第一次出现的降价标题

最小示例输入文件：

目标输出文件

4 回答 4

Related

Reference