attributes - 使用命令行工具向大型数据集添加属性

Question

我有一个非常大的数据集（大约 150 MB；500 个目标；700,000 多个属性）。我需要在每个文件的末尾添加一个属性。我正在使用的日期文件具有以下结构：

@relation 'filename'
@attribute "place" string
@attribute "institution" string
@attribute "food" string
@attribute "book" string

@data
3.8,6,0,0,church
86.3,0,63.1,0,man
0,0,0,37,woman

我需要在@data 之后的每一行中添加一个信息属性。但是，由于其属性数量众多，我无法在文本编辑器中打开和修改数据。我需要包含在一个单独的制表符分隔文件中的属性，该文件具有以下结构：

church  1
man 1
woman   0

期望的结果将使数据集如下所示：

@relation 'filename'
@attribute "place" string
@attribute "institution" string
@attribute "food" string
@attribute "book" string

@data
3.8,6,0,0,church,1
86.3,0,63.1,0,man,1
0,0,0,37,woman,0

该命令将在 @data 之后的每一行的末尾与第二个文件的每一行匹配，如果匹配，则添加相应的 0 或 1。

我一直在为此寻找解决方案，并且我的搜索大多提出了指向使用文本编辑器方向的答案。正如我之前提到的，文本编辑器的问题不一定是打开文件（例如，UltraEdit 可以处理大部分这种大小的文件）。它是在超过 700,000 个属性之后手动插入一个属性，这是一项非常耗时的任务。

所以，我问社区是否可以使用命令行参数（awk/grep 等）来实现所需的结果？

score 1 · Accepted Answer

Python 很棒，因为它默认安装在许多基于 POSIX 的系统上 :)

现在有一些警告：

这是一个简单的python，旨在让您边学习边学习，因此可以进行更多优化
这将在处理时将整个文件读入内存，因此如果您的文件在 GB 中，它会稍微影响您的计算机
print如果你想知道发生了什么，我建议抛出一些语句，或者使用 python 调试器单步执行程序。

这是我想出的：

lookup = {}
output_list = []

# build a lookup based on the lookup file
with open('lookup.csv', 'rb') as lookup_file:
    rows = lookup_file.readlines()

    for row in rows:
        key, value = row.split()
        lookup[key] = value

# loop through the big file and add the values
with open('input-big-data.txt', 'rb') as input_file:

    rows = input_file.readlines()
    target_zone = False

    for row in rows:

        # keep a copy of every row
        output_for_this_row = row

        # skip the normal attribute rows
        if row.startswith('@'):
            target_zone = False

        # check to see if we are in the 'target zone'
        if row.startswith('@data'):
            target_zone = True

        # start parsing the rows, but not if they have the attribute flag
        if target_zone and not row.startswith('@'):
            # do your data processing here
            # strip to clobber the newline, then break it into pieces
            row_list = row.strip().split(',')
            # grab the last item
            lookup_key = row_list[-1].strip()
            # grab the value for that last item
            row_list.append(lookup[lookup_key])
            # put the row back in it's original state
            output_for_this_row = ",".join(row_list) + "\n"

        output_list.append(output_for_this_row)


with open('output-big-data.txt', 'wb') as output_file:
    for line in output_list:
        output_file.write("{}".format(line))

我已经在整个过程中进行了非常彻底的评论，所以它应该是不言自明的。

从您问题中的文件中，我按顺序命名它们：input-big-data.txt、lookup.csv和output-big-data.csv.

这是我的示例的输出：

@relation 'filename'
@attribute "place" string
@attribute "institution" string
@attribute "food" string
@attribute "book" string

@data
3.8,6,0,0,church,1
86.3,0,63.1,0,man,1
0,0,0,37,woman,0

Hth，
亚伦

score 0 · Accepted Answer

如下所述，python 可以非常简单地解决这个问题，正如我在此博客上找到并使用的解决方案所示：http: //margerytech.blogspot.it/2011/03/python-appending-column-to-end-of -tab.html。

它不是命令行参数（正如我指出我想在问题中使用的那样），但它同样解决了问题。

attributes - 使用命令行工具向大型数据集添加属性

2 回答 2

Related

Reference