0

我正在尝试解析一个巨大的制表符限制文件(tsv 文件)并将其转换为逗号分隔值文件。我遇到的问题是,并非 tsv 文件中的所有条目都是完整的,其中一些条目不完整,并且由条目之间的多个制表符间距表示。现在,当我将其转换为 csv 文件时,我希望它们之间有“na”,表示该记录字段中没有任何条目。

例如,考虑学生记录示例(1 个制表符 = 4 个空格,请忍受我糟糕的格式)

Name    Age    Department    GPA
Kevin    21    Computer Science    3.4
Tom    20        3.8
Kelsey    22    Psychology        (2 tab spaces here)

在上面的示例中,第一条记录表示字段标题,每一行都是一条记录。我们可以观察到 Tom 缺少“部门”字段条目,而 Kelsey 缺少“GPA”字段条目。我的输出应该是这样的:

"Name","Age","Department","GPA"
"Kevin","21","Computer Science","3.4"
"Tom","20","n.a","3.8"
"Kelsey","22","Psychology","n.a"

我的问题:
1) 我该如何解决这个问题?Python、java、bash、awk 任何脚本都可以
2) 观察“部门”字段下第二行中“计算机”和“科学”之间的空格被忽略并保留。所以生成的脚本不应该计算空格。

完美地做到这一点非常重要,因为我将为搜索索引提供数据。提前致谢。

4

4 回答 4

4

这可以在 python 中非常简单地完成,如下所示:

import sys
[infile, outfile] = sys.argv[1:]

with open(infile) as inf:
    with open(outfile) as outf:
        for l in inf:
            outf.write(','.join(l.split('\t')).replace(',,',',n.a.,'))

该脚本将像

python convert_csv.py infile outfile
于 2012-08-10T22:09:04.520 回答
1

一种使用方式awk

awk '
    ## Split line with tabs, join them in output with commas.
    BEGIN {
        FS = "\t";
        OFS = ",";
    }

    ## For each line, check if any field is blank, and substitute with
    ## "n.a". Add double quotes, recompute line and print.
    {
        for ( i = 1; i <= NF; i++ ) {
            if ( $i == "" ) {
                $i = "n.a";
            }
            $i = "\"" $i "\"";
        }
        $1 = $1;
        print $0;
    }
' infile

使用以下输出运行它:

"Name","Age","Department","GPA"
"Kevin","21","Computer Science","3.4"
"Tom","20","n.a","3.8"
"Kelsey","22","Psychology","n.a"
于 2012-08-10T22:15:25.417 回答
0

只需在每一行上使用 split('\t') ...

>>> x="a\t\tb"
>>> x
'a\t\tb'
>>> print x
a               b
>>> x.split("\t")
['a', '', 'b']
>>>
于 2012-08-10T22:11:24.047 回答
0

在蟒蛇中,

inputFile = open.("yourFile.tsv", "r")
outputFile = open.("output.csv", "w")

for line in inputFile:
    entry = line.split("\t")
    for i in range(len(entry)):
        if entry[i] == '':
            entry[i] = "n.a"
    outputFile.write(",".join(entry))

inputFile.close()
outputFile.close()

应该可以工作,尽管它不是特别 Pythonic。

于 2012-08-10T22:12:55.217 回答