3

The goal: To loop through a folder of text files, extract all the end-of-line, word-wrapped, hyphenated words, and collate them into a list.

001.txt be-littled
001.txt dev-eloper
002.txt sand-wich
...

The purpose is to scan the list and differentiate the valid hyphenated words from the merely word-wrapped (i.e., twenty-four versus dev-eloper).

My current Bash/sed script catches most (enough) of the words correctly. I know it needs some tweaking (like when the hyphenated word ends the paragraph).

But right now, I can't get the current filename into the pattern space.

for f in *.txt
  do
    sed -rn 'N;/PATTERN/!{D};s:PATTERN:\3-\5\n\7:;P;D' * > output.txt;
  done

..where PATTERN = (^.)( +)(.+)(-\n)(\S+)( +)(.$)

or

for f in *.txt; do sed -rn 'N;/(^.*)( +)(.+)(-\n)(\S+)( +)(.*$)/!{D};s:(^.*)( +)(.+)(-\n)(\S+)( +)(.*$):\3-\5\n\7:;P;D' * > output.txt;done

I tried putting '"$f"' just before the \3 but that just prepends the last page on all lines (i.e., '250.txt be-littled').

I suspect my code isn't doing exactly what I think its doing. :-) Maybe I don't grok the loop order of sed within bash.

I'm using Ubuntu 12.10 and just started learning bash and sed a few weeks ago. I'm open to suggestions.

Thanks,

4

3 回答 3

1

我不知道你的意思是什么word-wrapped,但这可能有效:

grep -oH "[^ ]*-[^ ]*$" *.txt | sed 's/:/ /'

尾随sed调用只是为了使输出等于您的输出 - 它用空格替换第一个:添加的。grep

输出:

$ cat 001.txt 
be-littled
dev-eloper

$ cat 002.txt 
sand-wich

$ grep -oH "[^ ]*-[^ ]*$" *.txt | sed 's/:/ /'
001.txt be-littled
001.txt dev-eloper
002.txt sand-wich

注意:要改进用于满足您的需求的表达式grep,首先需要了解您的需求——我真的没有从word-wrapped;那里得到这个想法。

于 2013-03-02T17:21:50.877 回答
1

我不明白为什么您没有成功写入文件名。你写过你'"$f"'之前试过的\3,我认为应该可以。但是我做了几乎相同的事情,但对整个 sed 命令使用双引号,所以我不必使用'"..."'构造。
您还应该在写入结果文件时使用>>而不是,否则您会为循环中的每个新文件覆盖结果文件。 这可能是一个错字,但你在 sed 行的末尾应该不是我认为的。>
... * > output.txt$f*

对 sed 命令使用双引号,在!in之后使用空格! {D}>> output.txt$f在替换中写入文件名(也使用@作为替换分隔符以便能够<file>:在结果中使用):

for f in *.txt; do
  sed -rn "N;/(^.*)( +)(.+)(-\n)(\S+)( +)(.*$)/! {D};s@(^.*)( +)(.+)(-\n)(\S+)( +)(.*$)@$f: \3-\5\n\7@;P;D" $f >> output.txt
done

我还没有审查你的模式,但当我测试它时它似乎很有效。

我在两个小文件上进行了尝试,一个在您的问题中包含包裹的单词,另一个带有一些带有“虚拟单词”的行。

[]$ cat tf1.txt
asdf asdf be-
littled asdf asdf
asfd dev-
eloper asdf sand-
wich asdf asdf.
[]$ cat tf2.txt
asfd abc-
de lsdk laskfjd
asdf asdf 1234-
56 sdl sdg
sdfg

输出:

[]$ ./tfwordwrap.sh
tf1.txt: be-littled
tf1.txt: dev-eloper
tf1.txt: sand-wich
tf2.txt: abc-de
tf2.txt: 1234-56
于 2013-03-02T19:30:20.760 回答
0

我不知道,如何使用 sed 获取当前文件名。如果你不介意使用 perl,试试这个 perl 脚本

use strict;
use warnings;

my $hyphen;

while (<>) {
    next if (m/^\s*$/);

    if ($hyphen) {
        m/^\s*(\w+)/;
        print "$1\n";
        $hyphen = 0;
    }

    if (m/(\w+-)\s*$/) {
        print "$ARGV $1";
        $hyphen = 1;
    }
}

此脚本将行的最后一个连字符部分与文件名一起打印并设置一个标志。在下一行中,它查找此标志并打印该行的第一个单词。它还跳过空行。

你称之为

perl hyphen.pl file1.txt file2.txt ...
于 2013-03-02T16:39:46.583 回答