python - 如何动态识别数据文件中的未知分隔符？

Question

我有三个输入数据文件。每个都对其中包含的数据使用不同的分隔符。数据文件一如下所示：

苹果| 高分辨率照片| CLIPARTO 香蕉| 高分辨率照片| CLIPARTO 橘子| 高分辨率照片| CLIPARTO 葡萄

数据文件二如下所示：

四分之一，一角钱，镍，便士

数据文件三如下所示：

马 牛 猪 鸡 山羊

（列数的变化也是有意的）

我的想法是计算非字母字符的数量，并假设最高计数是分隔符。但是，具有非空格分隔符的文件在分隔符之前和之后也有空格，因此空格在所有三个文件上都占优势。这是我的代码：

def count_chars(s):
    valid_seps=[' ','|',',',';','\t']
    cnt = {}
    for c in s:
        if c in valid_seps: cnt[c] = cnt.get(c,0) + 1
    return cnt

infile = 'pipe.txt' #or 'comma.txt' or 'space.txt'
records = open(infile,'r').read()
print count_chars(records)

它将打印一个包含所有可接受字符计数的字典。在每种情况下，空间总是获胜，所以我不能依靠它来告诉我分隔符是什么。

但我想不出更好的方法来做到这一点。

有什么建议么？

score 96 · Accepted Answer

How about trying Python CSV's standard: http://docs.python.org/library/csv.html#csv.Sniffer

import csv

sniffer = csv.Sniffer()
dialect = sniffer.sniff('quarter, dime, nickel, penny')
print dialect.delimiter
# returns ','

score 5 · Accepted Answer

如果您使用的是 python，我建议您只在包含所有有效预期分隔符的行上调用re.split ：

>>> l = "big long list of space separated words"
>>> re.split(r'[ ,|;"]+', l)
['big', 'long', 'list', 'of', 'space', 'separated', 'words']

唯一的问题是其中一个文件是否使用分隔符作为数据的一部分。

如果您必须识别分隔符，最好的办法是计算除空格之外的所有内容。如果几乎没有出现，那么它可能是空格，否则，它是映射字符的最大值。

不幸的是，真的没有办法确定。您可能有用逗号填充的空格分隔数据，或者您可能有 | 用分号填充的分隔数据。它可能并不总是有效。

score 1 · Accepted Answer

由于空格的问题，我最终选择了正则表达式。这是我完成的代码，以防万一有人感兴趣，或者可以在其中使用其他任何东西。顺便说一句，找到一种动态识别列顺序的方法会很巧妙，但我意识到这有点棘手。与此同时，我正在使用旧技巧来解决这个问题。

for infile in glob.glob(os.path.join(self._input_dir, self._file_mask)):
            #couldn't quite figure out a way to make this a single block 
            #(rather than three separate if/elifs. But you can see the split is
            #generalized already, so if anyone can come up with a better way,
            #I'm all ears!! :)
            for row in open(infile,'r').readlines():
                if infile.find('comma') > -1: 
                    datefmt = "%m/%d/%Y"
                    last, first, gender, color, dobraw = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
                elif infile.find('space') > -1: 
                    datefmt = "%m-%d-%Y"
                    last, first, unused, gender, dobraw, color = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]

                elif infile.find('pipe') > -1:
                    datefmt = "%m-%d-%Y"
                    last, first, unused, gender, color, dobraw = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
                    #There is also a way to do this with csv.Sniffer, but the 
                    #spaces around the pipe delimiter also confuse sniffer, so
                    #I couldn't use it.
                else: raise ValueError(infile + "is not an acceptable input file.")

score 0 · Accepted Answer

我们可以根据一些先验信息（例如常用分隔符列表）和所有行给出相同数量的分隔符的频率计数来确定分隔符的正确性

def head(filename: str, n: int):
    try:
        with open(filename) as f:
            head_lines = [next(f).rstrip() for x in range(n)]
    except StopIteration:
        with open(filename) as f:
            head_lines = f.read().splitlines()
    return head_lines


def detect_delimiter(filename: str, n=2):
    sample_lines = head(filename, n)
    common_delimiters= [',',';','\t',' ','|',':']
    for d in common_delimiters:
        ref = sample_lines[0].count(d)
        if ref > 0:
            if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
                return d
    return ','

通常 n=2 行就足够了，检查更多行以获得更可靠的答案。当然，有些情况（通常是人为的）会导致错误检测，但在实践中不太可能发生。

在这里，我使用了一个高效的 python 实现的 head 函数，它只读取文件的第 n 行。请参阅我关于如何读取文件的前 N 行的答案

python - 如何动态识别数据文件中的未知分隔符？

4 回答 4

Related

Reference