awk - 在散文上使用 awk

Question

我有一个排序的短语列表，list.txt。我想使用 awk 从冗长的散文文件中删除该列表中的任何条目，并将其替换为返回。不难找到使用 awk 比较两个文件的示例，但它们都假设两者结构整齐，而散文则不然。

这是脚本的相关部分：

#! /bin/sh
...

sed '
s/[0-9]/\n/g        # strip out all numbers, replace with returns
s/[@€•\!¡%“”"_–=\*\&\/\?¿\,\.]/\n/g
' $1 > $1.z.tmp

cp stowplist.txt strip1.tmp

awk 'BEGIN { FS = "\t" } ; { print $1 }' SpanishGlossary.utf8 >> strip1.tmp
#sh ./awkwords SpanishGlossary.utf8 >> strip1.tmp

sort -u strip1.tmp > strip2.tmp

awk '{ print length(), $0 | "sort -rn" }' strip2.tmp > strip3.tmp
#echo "List ordered by length."

#echo "Now creating new script." # THIS AFFECTS THE SCRIPT, NOT THE OUTPUT FILE.
sed '
s/[0-9]//g      # strip out all numbers
s/[\t^\ *\ $]// # strip tabs, leading and trailing spaces
/^.\{0,5\}$/d       # delete lines with less than five characters
/^$/d           # delete blank lines
s/^/\\y/g           # begin word boundary
s/$/\\y/g           #end word boundary
s/\ /\\ /g      # make spaces into literals
' strip3.tmp > strip.tmp

echo "Eliminating existing entries. This may take a while."
awk 'NR==FNR{p = p s $0; s="|" ;next} {gsub(p,"\n");print}' strip.tmp $1.z.tmp > $1.1.tmp

...

这是 strip.tmp 的代表性示例：

\yinfraestructura\ de\ la\ fabricación\y
\yFecha\ de\ Vencimiento\ del\ Contrato\y
\yfactores\ importantes\ a\ considerar\y
\yexcepto\ lo\ estrictamente\ personal\y
\yexamen\ de\ los\ ojos\ con\ dilatación\y
\yes\ un\ estado\ capitalista\ corrupto\y
\yes\ un\ derecho\ legal\ reconocido\ en\y
\yestimular\ la\ capacidad\ productiva\y
\yestimación\ de\ la\ edad\ gestacional\y
\yEste\ Programa\ de\ Transición\ Verde\y
\yEstán\ permanentemente\ enfrentados\y

最后，输入文本的代表性样本，标点符号替换为换行符。

Es la historia de más de un siglo del cooperativismo en Argentina
 con empresas en todos los rincones de nuestra geografía y en todos los sectores de la economía

En plena crisis del sistema económico mundial
 con creciente alarma frente al deterioro a escala planetaria de las condiciones medio ambientales
 la comunidad internacional ha declarado
 desde la Organización de las Naciones Unidas
 a éste como el Año Internacional de las Cooperativas

No es casualidad
 el mundo está buscando nuevos caminos
 nuevos paradigmas para organizarse

score 2 · Accepted Answer

@Kent 发布：

awk 'NR==FNR{p[$0];next}{a[FNR]=$0}END{for(i=1;i<=FNR;i++){for(v in p)gsub(v,"",a[i]);print a[i]}}' file1 file2

l为了v便于阅读，我将变量更改为 - 永远不要l用作变量名，因为它看起来太像数字了1。

上面将整个 file2 读入一个数组，然后遍历该数组进行替换，而不是在读取每一行时仅进行替换，例如：

awk 'NR==FNR{p[$0];next} {for(v in p)gsub(v,"");print}' file1 file2

但是一个更快的替代方法是，而不是构建一个要删除的短语数组，只需构建一个 RE 字符串，这样您就可以在 file2 的每一行上执行一个 gsub() 而不是每个短语一个 gsub()文件1：

awk 'NR==FNR{p = p s $0; s="|" ;next} {gsub(p,"");print}' file1 file2

请注意所有您正在进行 RE 比较的人，以便 file1 中的 RE 元字符会对 file2 中的匹配项产生影响。由于您正在与 sed 解决方案进行比较，因此我认为这很好。

如果您只关心速度，那么这个 GNU awk 解决方案可能会更快：

$ gawk -v RS='\0' -v FS='\n' -v OFS='|' 'NR==FNR{NF--; p=$0; next} {gsub(p,"");print}' file1 file2
line1 blah () blah ()
line2 blah () blah ()()
line3 blah blah () ()()

但它非常神秘，比其他的使用更多的内存，并且不是很可扩展，所以我不会打扰它。

我会使用上面的解决方案，将“p”构建为单个 RE，并在每一行上执行一个 gsub()。

score 1 · Accepted Answer

这个单线可能适合您：

 awk 'NR==FNR{p[$0];next}{a[FNR]=$0}END{for(i=1;i<=FNR;i++){for(l in p)gsub(l,"",a[i]);print a[i]}}' file1 file2

笔记：

file1 is your list.txt
file2 is your prose

小例子：

kent$  head file*                                                                                                     
==> file1 <==
good for you
hi there
awk is nice

==> file2 <==
line1 blah (hi there) blah (good for you)
line2 blah (awk is nice) blah (hi there)(good for you)
line3 blah blah (good for you) (awk is nice)(hi there)

kent$  awk 'NR==FNR{p[$0];next}{a[FNR]=$0}END{for(i=1;i<=FNR;i++){for(l in p)gsub(l,"",a[i]);print a[i]}}' file1 file2
line1 blah () blah ()
line2 blah () blah ()()
line3 blah blah () ()()

awk - 在散文上使用 awk

2 回答 2

Related

Reference