perl - 修复.txt文件中的断线的脚本？

Question

我想在我的 Kindle 上正确阅读书籍。

为了实现我的梦想，我需要一个脚本来修复 txt 文件中的断行。

例如，如果 txt 文件有这一行：

He watched Kahlan as she walked with her shoulders slumped
down.

...然后它应该通过删除单词“down”之前的换行符来修复它：

He watched Kahlan as she walked with her shoulders slumped down.

那么，程序员同胞们，（a）最简单的方法和（b）最好的语言是什么？

ps 该解决方案将涉及在第 1 列中搜索一个小写字母，并删除它之前的换行符以将这些行拼接在一起。在我试图修复的小说中，这种“流氓换行”出现了 120 万次。

score 2 · Accepted Answer

有很多方法可以做到这一点。我会推荐一些类似于 Perl、Python 或 Ruby 的东西。如果您希望使用快速而肮脏的单线来做到这一点，那么 Perl 在该部门具有优势。

例如，这将满足您的要求：

# Slurp entire file.
# Convert newlines followed by lower-case letter.
perl -p -e 'BEGIN {$/ = undef}    s/\n(?=[a-z])/ /g' book.txt

但如果段落由 2 个换行符分隔，这可能会更好。

# Process file a "paragraph" at a time.
# Convert newlines followed by at least 2 characters.
perl -p -e 'BEGIN {$/ = qq{\n\n}} s/\n(?=..)/ /g'    book.txt

score 1 · Accepted Answer

如果段落之间有空格：按段落读取文本（设置$/ = "\n\n"'），然后使用CPAN 中的Text::Autoformat。

示例（用常规文件句柄替换 DATA ——我只是在示例中使用它方便）：

use strict;
use warnings;
use Text::Autoformat;

local $/ = "\n\n";
while (<DATA>) {
    print autoformat $_, {left=>1, right=>80};
}


__DATA__
He watched Kahlan as she walked with her shoulders slumped 
down. 

He watched Kahlan as she walked with her shoulders slumped 
down. 
He watched Kahlan as she walked with her shoulders slumped 
down. 
He watched Kahlan as she walked with her shoulders slumped 
down. 

He watched Kahlan as she walked with her shoulders slumped 
down. 
He watched Kahlan as she walked with her shoulders slumped 
down.

输出：

He watched Kahlan as she walked with her shoulders slumped down.

He watched Kahlan as she walked with her shoulders slumped down. He watched
Kahlan as she walked with her shoulders slumped down. He watched Kahlan as she
walked with her shoulders slumped down.

He watched Kahlan as she walked with her shoulders slumped down. He watched
Kahlan as she walked with her shoulders slumped down.

score 0 · Accepted Answer

如果段落之间有换行符，您也许可以在一个好的文本编辑器中打开它，该编辑器具有“展开文本”选项。其中之一是 Mac 的 TextMate，但也可能有适用于 Windows 的选项。

score 0 · Accepted Answer

我会说解析这本书并寻找换行符的出现。如果它在一段时间后没有出现，则将其删除。唯一的问题是它在这种特殊情况下不起作用：

他看着 Kahlan 垂着肩膀走路。\n

他看着卡兰低着肩膀走路。

代替：

他看着卡兰低着肩膀走路。他看着卡兰低着肩膀走路。

在这种情况下，您必须确定段落的分隔方式（它们是两个换行符吗？）。如果是这种情况，请在一段时间后检查是否有两个换行符。如果不是，则删除第一个换行符。

score 0 · Accepted Answer

使用正则表达式匹配前面紧跟换行符的小写字符，然后用空格替换该换行符，应该可以解决问题。

这是一个 C# 实现；

    string UnwrapText(string input)
    {
        return Regex.Replace(input, Environment.NewLine + "[a-z]",
                            delegate(Match m)
                            {
                                return m.ToString().Replace(Environment.NewLine, " ");
                            });
    }

score 0 · Accepted Answer

如果段落以制表符开头，最有效的方法可能是删除所有不在制表符之前的换行符并用空格替换它们。

如果没有，您可以取消所有不在 2 个或更多换行符序列中的换行符。

您也可以取消所有不跟随句点的换行符，但如前所述，如果句子结束一行而不是段落结束，这将失败。

score 0 · Accepted Answer

用 vim 打开文件:set tw=0 noai，然后gggqG. 如果文件表现得相当好，那应该去掉段落中的所有换行符，同时保留段落换行符。

perl - 修复.txt文件中的断线的脚本？

7 回答 7

Related

Reference