python - 从具有数百万条记录的大型 CSV 文件中删除不需要的不可打印字符 - 在 Python 3 或 2.7 中

Question

示例文件我收到用（逗号或 | 或 ^）分隔的包含数百万条记录的大型 CSV 文件。
某些字段具有不可打印的字符，例如 CR|LF，它被翻译为字段结尾。这是在windows10中。

我需要编写 python 来遍历文件并删除字段中的 CR|LF。但是，我不能全部删除，因为这样行将被合并。

我在这里浏览了几篇关于如何删除不可打印的帖子。我想写一个熊猫数据框，然后检查每个字段的 CR|LF 并将其删除。好像有点复杂。如果你有一个快速的代码如何做到这一点，这将是很大的帮助。

提前致谢。

示例文件：

record1, 111. texta, textb CR|LF
record2, 111. teCR|LF
xta, textb CR|LF
record3, 111. texta, textb CR|LF

示例输出文件应为：

record1, 111. texta, textb CR|LF
record2, 111. texta, textb CR|LF
record3, 111. texta, textb CR|LF

CR = 回车 = x0d LF = 换行 = x0a

score 0 · Accepted Answer

编辑——9 月 18 日星期三 23.46 UTC+2

笔记：

record1|111. texta|textb|111CR|LF
record2|111. teCR|LF 
xta|text|111CR|LF
record3|111. texta|textb|111CR|LF

这是我们要分析的文件

由于我们有一个 csv 文件，我们可以确定数据类型在给定列的行之间是一致的。

由于这个假设，我们可以用正则表达式 ( ) 正则CL|RF表达式 ( \|\d+CR\|LF) 中的分隔符。

如果正则表达式不匹配，我们可以去掉回车，原因不是行尾。

import pandas as pd
from io import StringIO
import re

# Verify that the pattern `|ARBITRARY NUMBER + CR|LF`
pattern = re.compile("\|\d+CR\|LF")
# Open the file and read the content
with open("a.txt") as f:
    data = f.readlines()
not_parsed = data.copy()
_max = len(data)
i = 0
parsed_data = []
# Iterate the data
while i < _max:
    # Remove unnecessary new line
    line = data[i].strip()
    # If the pattern does not match, we need to strip the carriage return
    if not pattern.search(line) and i + 1 < _max:
        line = line.replace("CR|LF", "").strip()
        line = line + data[i + 1].strip()
        i += 1
    line = line
    if line != "":
        parsed_data.append(line)
    i += 1
# Comment
data = [line.replace("CR|LF", "") for line in parsed_data]
# Load the csv using pandas

print("DATA BEFORE -> {}".format("".join(not_parsed)))
print("DATA NOW -> {}".format("\n".join(data)))
DATA = StringIO("\n".join(data))
df = pd.read_csv(DATA, delimiter="|")

这如何删除不需要的 CL|RF，但留下想要的？

该文件不会被修改，而是会保存为单个列表（“逐行”）。然后我们将仅在正则表达式不匹配并作为数据帧加载时才替换回车符

注意：在该用途上进行了
测试Linux\nnew line

score 0 · Accepted Answer

在你的文件上运行这个脚本（例如命名它fix_csv.py）来清理它：

#!/usr/bin/env python3

import sys
import os

if len(sys.argv) < 3:
    sys.stderr.write('Please give the input filename and an output filename.\n')
    sys.exit(1)

# set the correct number of fields
nf = 3
# set the delimiter
delim = ','

inpf = sys.argv[1]
outf = sys.argv[2]

newline = os.linesep

with open(inpf, 'r') as inf, open(outf, 'w') as of:
    cache = []
    for line in inf:
        line = line.strip()
        ls = line.split(delim)
        if len(ls) < nf or cache:
            if not cache:
                cache = cache + ls
            elif cache:
                cache[-1] += ls[0]
                cache = cache + ls[1:]
            if len(cache) == nf:
                of.write(f'{delim}'.join(cache) + newline)
                cache = []
        else:
            of.write(line + newline)

像这样称呼它

./fix_csv input.dat output.dat

输出：

record1, 111. texta, textb
record2, 111. texta, textb
record3, 111. texta, textb

python - 从具有数百万条记录的大型 CSV 文件中删除不需要的不可打印字符 - 在 Python 3 或 2.7 中

2 回答 2

编辑——9 月 18 日星期三 23.46 UTC+2

Related

Reference