python - 计算多个标签 tsv 文件

Question

我正在尝试解析一个巨大的制表符限制文件（tsv 文件）并将其转换为逗号分隔值文件。我遇到的问题是，并非 tsv 文件中的所有条目都是完整的，其中一些条目不完整，并且由条目之间的多个制表符间距表示。现在，当我将其转换为 csv 文件时，我希望它们之间有“na”，表示该记录字段中没有任何条目。

例如，考虑学生记录示例（1 个制表符 = 4 个空格，请忍受我糟糕的格式）

Name    Age    Department    GPA
Kevin    21    Computer Science    3.4
Tom    20        3.8
Kelsey    22    Psychology        (2 tab spaces here)

在上面的示例中，第一条记录表示字段标题，每一行都是一条记录。我们可以观察到 Tom 缺少“部门”字段条目，而 Kelsey 缺少“GPA”字段条目。我的输出应该是这样的：

"Name","Age","Department","GPA"
"Kevin","21","Computer Science","3.4"
"Tom","20","n.a","3.8"
"Kelsey","22","Psychology","n.a"

我的问题：
1) 我该如何解决这个问题？Python、java、bash、awk 任何脚本都可以
2) 观察“部门”字段下第二行中“计算机”和“科学”之间的空格被忽略并保留。所以生成的脚本不应该计算空格。

完美地做到这一点非常重要，因为我将为搜索索引提供数据。提前致谢。

score 4 · Accepted Answer

这可以在 python 中非常简单地完成，如下所示：

import sys
[infile, outfile] = sys.argv[1:]

with open(infile) as inf:
    with open(outfile) as outf:
        for l in inf:
            outf.write(','.join(l.split('\t')).replace(',,',',n.a.,'))

该脚本将像

python convert_csv.py infile outfile

score 1 · Accepted Answer

一种使用方式awk：

awk '
    ## Split line with tabs, join them in output with commas.
    BEGIN {
        FS = "\t";
        OFS = ",";
    }

    ## For each line, check if any field is blank, and substitute with
    ## "n.a". Add double quotes, recompute line and print.
    {
        for ( i = 1; i <= NF; i++ ) {
            if ( $i == "" ) {
                $i = "n.a";
            }
            $i = "\"" $i "\"";
        }
        $1 = $1;
        print $0;
    }
' infile

使用以下输出运行它：

"Name","Age","Department","GPA"
"Kevin","21","Computer Science","3.4"
"Tom","20","n.a","3.8"
"Kelsey","22","Psychology","n.a"

score 0 · Accepted Answer

只需在每一行上使用 split('\t') ...

>>> x="a\t\tb"
>>> x
'a\t\tb'
>>> print x
a               b
>>> x.split("\t")
['a', '', 'b']
>>>

score 0 · Accepted Answer

在蟒蛇中，

inputFile = open.("yourFile.tsv", "r")
outputFile = open.("output.csv", "w")

for line in inputFile:
    entry = line.split("\t")
    for i in range(len(entry)):
        if entry[i] == '':
            entry[i] = "n.a"
    outputFile.write(",".join(entry))

inputFile.close()
outputFile.close()

应该可以工作，尽管它不是特别 Pythonic。

python - 计算多个标签 tsv 文件

4 回答 4

Related

Reference