The goal: To loop through a folder of text files, extract all the end-of-line, word-wrapped, hyphenated words, and collate them into a list.
001.txt be-littled
001.txt dev-eloper
002.txt sand-wich
...
The purpose is to scan the list and differentiate the valid hyphenated words from the merely word-wrapped (i.e., twenty-four versus dev-eloper).
My current Bash/sed script catches most (enough) of the words correctly. I know it needs some tweaking (like when the hyphenated word ends the paragraph).
But right now, I can't get the current filename into the pattern space.
for f in *.txt
do
sed -rn 'N;/PATTERN/!{D};s:PATTERN:\3-\5\n\7:;P;D' * > output.txt;
done
..where PATTERN = (^.)( +)(.+)(-\n)(\S+)( +)(.$)
or
for f in *.txt; do sed -rn 'N;/(^.*)( +)(.+)(-\n)(\S+)( +)(.*$)/!{D};s:(^.*)( +)(.+)(-\n)(\S+)( +)(.*$):\3-\5\n\7:;P;D' * > output.txt;done
I tried putting '"$f"' just before the \3 but that just prepends the last page on all lines (i.e., '250.txt be-littled').
I suspect my code isn't doing exactly what I think its doing. :-) Maybe I don't grok the loop order of sed within bash.
I'm using Ubuntu 12.10 and just started learning bash and sed a few weeks ago. I'm open to suggestions.
Thanks,