bash - awk 动态文档索引

Question

我有一个文档，我需要在其中动态创建/更新索引。我正在尝试用 awk 来完成它。我有一个部分工作示例，但现在我很难过。

示例文档如下。

numbers.txt:
    #) Title
    #) Title
    #) Title
    #.#) Subtitle
    #.#.#) Section
    #.#) Subtitle
    #) Title
    #) Title
    #.#) Subtitle
    #.#.#) Section
    #.#) Subtitle
    #.#.#) Section
    #.#.#.#) Subsection
    #) Title
    #) Title
    #.#) Subtitle
    #.#.#) Section
    #.#.#.#) Subsection
    #.#.#.#) Subsection

所需的输出将是：

1) Title
2) Title
3) Title
3.1) Subtitle
3.1.1) Section
3.2) Subtitle
4) Title
5) Title
5.1) Subtitle
5.1.1) Section
5.2) Subtitle
5.2.1) Section
5.2.1.1) Subsection
6) Title
7) Title
7.1) Subtitle
7.1.1) Section
7.1.1.1) Subsection
7.1.1.2) Subsection

我拥有的部分有效的 awk 代码如下。

numbers.sh:
    awk '{for(w=1;w<=NF;w++)if($w~/^#\)/){sub(/^#/,++i)}}1' number.txt

对此的任何帮助将不胜感激。

score 4 · Accepted Answer

我已经为你实现了一个AWK脚本！它仍然适用于超过四个级别的索引！;)

我将尝试用内联注释稍微解释一下：

#!/usr/bin/awk -f

# Clears the "array" starting from "from"                                       
function cleanArray(array,from){                                                
    for(w=from;w<=length(array);w++){                                           
        array[w]=0                                                              
    }                                                                           
}                                                                               

# This is executed only one time at beginning.                                  
BEGIN {                                                                         
    # The key of this array will be used to point to the "text index".
    # I.E., an array with (1 2 2) means an index "1.2.2)"           
    array[1]=0      
}                                                                               

# This block will be executed for every line.                                   
{                                                                               
    # Amount of "#" found.                                                      
    amount=0                                                                    

    # In this line will be stored the result of the line.                       
    line=""                                                                     

    # Let's save the entire line in a variable to modify it.                    
    rest_of_line=$0                                                             

    # While the line still starts with "#"...                                   
    while(rest_of_line ~ /^#/){                                                 

        # We remove the first 2 characters.                                     
        rest_of_line=substr(rest_of_line, 3, length(rest_of_line))              

        # We found one "#", let's count it!                                     
        amount++                                                                

        # The line still starts with "#"?                                       
        if(rest_of_line ~ /^#/){                                                
            # yes, it still starts.                                             

            # let's print the appropiate number and a ".".                      
            line=line""array[amount]                                            
            line=line"."                                                        
        }else{                                                                  
            # no, so we must add 1 to the old value of the array.       
            array[amount]++                                                     

            # And we must clean the array if it stores more values              
            # starting from amount plus 1. We don't want to keep                
            # storing garbage numbers that may harm our accounting              
            # for the next line.                                                
            cleanArray(array,amount + 1)                                        

            # let's print the appropiate number and a ")".                      
            line=line""array[amount]                                            
            line=line")"                                                        
        }                                                                       
    }                                                                           

    # Great! We have the line with the appropiate indexes!                      
    print line""rest_of_line                                                    
}

因此，如果将其保存为script.awk，则可以执行它，并为文件添加执行权限：

chmod u+x script.awk

最后，您可以执行它：

./script.awk <path_to_number.txt>

例如，如果您将脚本script.awk保存在文件number.txt所在的同一目录中，则将目录更改为该目录并执行：

./script.awk number.txt

所以，如果你有这个number.txt

#) Title
#) Title
#) Title
#.#) Subtitle
#.#.#) Section
#.#) Subtitle
#) Title
#) Title
#.#) Subtitle
#.#.#) Section
#.#) Subtitle
#.#.#) Section
#.#.#.#) Subsection
#) Title
#) Title
#.#) Subtitle
#.#.#) Section
#.#.#.#) Subsection
#.#.#.#.#) Subsection
#.#.#.#.#) Subsection
#.#.#.#.#) Subsection
#.#.#.#.#.#) Subsection
#.#.#.#.#) Subsection
#.#.#.#.#.#) Subsection
#.#.#.#.#.#) Subsection
#.#.#.#.#.#) Subsection
#.#.#.#.#.#) Subsection
#.#.#.#.#) Subsection
#.#.#.#) Subsection
#.#.#) Section

这将是输出（请注意，解决方案不受“#”数量的限制）：

1) Title
2) Title
3) Title
3.1) Subtitle
3.1.1) Section
3.2) Subtitle
4) Title
5) Title
5.1) Subtitle
5.1.1) Section
5.2) Subtitle
5.2.1) Section
5.2.1.1) Subsection
6) Title
7) Title
7.1) Subtitle
7.1.1) Section
7.1.1.1) Subsection
7.1.1.1.1) Subsection
7.1.1.1.2) Subsection
7.1.1.1.3) Subsection
7.1.1.1.3.1) Subsection
7.1.1.1.4) Subsection
7.1.1.1.4.1) Subsection
7.1.1.1.4.2) Subsection
7.1.1.1.4.3) Subsection
7.1.1.1.4.4) Subsection
7.1.1.1.5) Subsection
7.1.1.2) Subsection
7.1.2) Section

我希望它对你有帮助！

score 4 · Accepted Answer

awk救援！

我不确定这是执行此操作的最佳方式，但有效...

awk    'BEGIN{d="."}
/#\.#\.#\.#/ {sub("#.#.#.#", i d a[i] d b[i d a[i]] d (++c[i d a[i] d b[i d a[i]]]))}
   /#\.#\.#/ {sub("#.#.#"  , i d a[i] d (++b[i d a[i]]))}
      /#\.#/ {sub("#.#"    , i d (++a[i]))}
         /#/ {sub("#"      , (++i))} 1'

更新：以上仅限于 4 个级别。这是一个更好的无限数量级别

 awk '{d=split($1,a,"#")-1;                # find the depth
       c[d]++;                             # increase counter for current          
       for(i=pd+1;i<=d;i++) c[i]=1;        # reset when depth increases
       for(i=1;i<=d;i++) {sub(/#/,c[i])};  # replace digits one by one
       pd=d} 1'                            # set previous depth and print

也许重置步骤可以与主循环结合使用，但我认为这样更清楚。

更新 2：

我认为按照这种逻辑，以下是最短的。

$ awk '{d=split($1,_,"#")-1;      # find the depth
        c[d]++;                   # increment counter for current depth
        for(i=1;i<=d;i++)         # start replacement
           {if(i>pd)c[i]=1;       # reset the counters
            sub(/#/,c[i])         # replace placeholders with counters
           }
           pd=d} 1' file          # set the previous depth

或作为单线

$ awk '{d=split($1,_,"#")-1;c[d]++;for(i=1;i<=d;i++){if(i>pd)c[i]=1;sub(/#/,c[i])}pd=d}1'

score 2 · Accepted Answer

与@karakfa 的方法相同（简短而甜美），并且对假定的最大副标题数量有同样的警告，但更短且更有效：

awk 'BEGIN{d="."}
  /#\.#\.#\.#/ {sub("#.#.#.#", i d a d b d (++c) )}
     /#\.#\.#/ {sub("#.#.#"  , i d a d (++b) );  c=0;}
        /#\.#/ {sub("#.#"    , i d (++a));       b=0;}
           /#/ {sub("#"      , (++i));           a=0;} 1'

score 2 · Accepted Answer

这是我对此的看法。在 FreeBSD 中测试，所以我希望它几乎可以在任何地方工作......

#!/usr/bin/awk -f

BEGIN {
  depth=1;
}

$1 ~ /^#(\.#)*\)$/ {
  thisdepth=split($1, _, ".");

  if (thisdepth < depth) {
    # end of subsection, back out to current depth by deleting array values
    for (; depth>thisdepth; depth--) {
      delete value[depth];
    }
  }
  depth=thisdepth;

  # Increment value of last member
  value[depth]++;

  # And substitute it into the current line.
  for (i=1; i<=depth; i++) {
    sub(/#/, value[i], $0);
  }
}

1

基本思想是我们维护value[]嵌套章节值的数组 ()。根据需要更新数组后，我们逐步遍历这些值，每次都将第一次出现的 octothorpe ( #) 替换为数组该位置的当前值。

这将处理任何级别的嵌套，正如我上面提到的，它应该在 GNU（Linux）和非 GNU（FreeBSD、OSX 等）版本的 awk 中工作。

当然，如果你喜欢单线，这可以被压缩：

awk -vd=1 '$1~/^#(\.#)*\)$/{t=split($1,_,".");if(t<d)for(;d>t;d--)delete v[d];d=t;v[d]++;for(i=1;i<=d;i++)sub(/#/,v[i],$0)}1'

为了便于阅读，也可以这样表达：

awk -vd=1 '$1~/^#(\.#)*\)$/{              # match only the lines we care about
    t=split($1,_,".");                    # this line has 't' levels
    if (t<d) for(;d>t;d--) delete v[d];   # if levels decrease, trim the array
    d=t; v[d]++;                          # reset our depth, increment last number
    for (i=1;i<=d;i++) sub(/#/,v[i],$0)   # replace hash characters one by one
  } 1'                                    # and print.

更新

在考虑了一会儿之后，我意识到这可以进一步缩小。循环包含自己的for条件，无需将其放在if. 和

awk '{
    t=split($1,_,".");                  # get current depth
    v[t]++;                             # increment counter for depth
    for(;d>t;d--) delete v[d];          # delete record for previous deeper counters
    d=t;                                # record current depth for next round
    for (i=1;i<=d;i++) sub(/#/,v[i],$0) # replace hashes as required.
  } 1'

这当然会缩小成这样的一个衬里：

awk '{t=split($1,_,".");v[t]++;for(;d>t;d--)delete v[d];d=t;for(i=1;i<=d;i++)sub(/#/,v[i],$0)}1' file

显然，如果需要，您可以添加初始匹配条件，以便只处理看起来像标题的行。

尽管长了几个字符，但我相信这个版本的运行速度比 karakfa 的类似解决方案略快，可能是因为它避免了循环if每次迭代的额外内容for。

更新#2

我包括这个是因为我发现它很有趣。您可以单独在 bash 中执行此操作，无需 awk。就代码而言，它的时间并不长。

#!/usr/bin/env bash

while read word line; do
  if [[ $word =~ [#](\.#)*\) ]]; then
    IFS=. read -ra a <<<"$word"
    t=${#a[@]}
    ((v[t]++))
    for (( ; d > t ; d-- )); do unset v[$d]; done
    d=t
    for (( i=1 ; i <= t ; i++ )); do
      word=${word/[#]/${v[i]}}
    done
  fi
  echo "$word $line"
done < input.txt

这遵循与上面的 awk 脚本相同的逻辑，但完全在 bash 中使用参数扩展来替换#字符。它遭受的一个缺陷是它不会在每行的第一个单词周围保留空格，因此您会丢失任何缩进。通过一些工作，这也可以得到缓解。

享受。

score 2 · Accepted Answer

呆呆

awk 'function w(){
    k=m>s?m:s
    for(i=1;i<=k;i++){
        if(i>m){
            a[i]=0
        }
        else{
            a[i]=(i==m)?++a[i]:a[i]   #ended "#" increase
            sub("#",a[i]=a[i]?a[i]:1) 
        }
    }
    s=m
}
{m=split($1,t,"#")-1;w()}1' file



1) Title
2) Title
3) Title
3.1) Subtitle
3.1.1) Section
3.2) Subtitle
4) Title
5) Title
5.1) Subtitle
5.1.1) Section
5.2) Subtitle
5.2.1) Section
5.2.1.1) Subsection
6) Title
7) Title
7.1) Subtitle
7.1.1) Section
7.1.1.1) Subsection
7.1.1.2) Subsection

score 1 · Accepted Answer

这是另一种方法。

代码下方提供了解释。

awk 'BEGIN {n0=1; prev=0}
   {n1=split($1, elems, ".");  # Get the number of pound signs
    dif = (n1-n0);             # Increase in topic depth from previous line
    scale = (10 ^ dif);        # 10 raised to dif
    current=(int(prev*scale)+1);  # scale the number by change in depth
    withdots=gensub(/([0-9])/, "\\1." , "g", current);  # dot between digits
    {print withdots, $2 }
     n0=n1;
     prev=current}' number.txt


1) Title
2) Title
3) Title
3.1) Subtitle
3.1.1) Section
3.2) Subtitle
4) Title

将主题编号视为十进制数。
我们通过公式从前一个数字中得到当前数字10 ^ dif + 1，

其中 diff =(Increase in number of levels from previous line) 最初dif是 0，所以我们从 1 中得到 2，从 2 中得到 3，
通过1 * (10 ^ 0) +1= 1 * 1 + 1=2
和2 * (10 ^ 0) +1= 2 * 1 + 1=3

然后我们从 3 到3 * (10 ^ 1) + 1
32 从 311得到 31 311 * (10 ^ -1) + 1，依此类推

bash - awk 动态文档索引

6 回答 6

Related

Reference