此代码从目录中获取所有 .stm 文件,并在执行此代码后保存在文件中。
cat db/some_path/stm/*.stm | sort -k1,1 -k2,2 -k4,4n | \
sed -e 's:<F0_M>:<o,f0,male>:' \
-e 's:<F0_F>:<o,f0,female>:' \
-e 's:([0-9])::g' \
-e 's:<sil>::g' \
-e 's:([^ ]*)$::' | \
awk '{ $2 = "A"; print $0; }'
} | local/join_suffix.py > data/dev.orig/stm
样本输出
AimeeMullins_2009P A inter_segment_gap 0 17.82 <o,,unknown> ignore_time_segment_in_scoring
AimeeMullins_2009P A AimeeMullins 17.82 28.81 <o,f0,female> i'd like to share with you a discovery that i made a few months ago while writing an article for italian wired i always keep my thesaurus handy whenever i'm writing anything but
AimeeMullins_2009P A AimeeMullins 28.81 40.266 <o,f0,female> i'd already finished editing the piece and i realized that i had never once in my life looked up the word disabled to see what i'd find let me read you the entry
AimeeMullins_2009P A inter_segment_gap 40.266 41.418 <o,,unknown> ignore_time_segment_in_scoring
我不明白使用 sed -e 如何格式化它。
我对
awk '{ $2 = "A"; print $0; }'
每一行的这一行含义的理解取第二个单词并检查它是否等于 A 然后打印第一个单词,但是这些-e 's:<sil>::g'
含义是什么?