python - 快速多行正则表达式查找/替换 \r 和 \n

Question

我正在处理大约 1 GB 的大型文本数据集（最小的文件有大约 200 万行）。每一行应该被分成许多列。我说应该是因为有例外；虽然正常的行以结尾\r\n，但很多行被错误地划分为 2 到 3 行。

鉴于有 10 列，每行应该具有以下格式：

col_1 | col_2 | col_3 | ... | col_10\r\n

异常具有以下格式：

1.  col_1 | col_2 | col_3 ...\n
    ... | col_10\r\n

2.  col_1 | col_2 | col_3 ...\n
    ... | col_10\n
    \r\n

纠正这些异常的最快方法是什么？我使用正则表达式(^[^\r\n]*)\n（替换为$1）在文本编辑器（Mac 上的 TextMate）中对 1000 行样本进行了简单的查找/替换，并且效果很好。但文本编辑器显然无法处理大文件（>= 200 万行）。是否可以使用sed或grep（或在其他命令行工具中，甚至在 Python 中）使用等价的正则表达式来完成，以及如何？

score 1 · Accepted Answer

你的方法：

perl -pe 's/(^[^\r\n]*)\n/\1/' input > output

或者，消极的回顾：

perl -pe 's/(?<!\r)\n//' input > output

或者，删除所有\n并替换\r为\r\n：

perl -pe 's/\n//; s/\r/\r\n/' input > output

score 1 · Accepted Answer

为什么不awk？：

awk 'BEGIN{RS="\r\n"; FS="\n"; OFS=" "; ORS="\r\n";} {print $1,$2}' input

或 tr + sed：

cat input | tr '\n' ' ' | tr '\r' '\n' | sed 's/^ \(.*\)/\1\r/g'

python - 快速多行正则表达式查找/替换 \r 和 \n

2 回答 2

Related

Reference