1

我需要对结构如下的csv进行一些数据处理:

在此处输入图像描述

我需要折叠字段条目为空的行的 TEXT 列中的所有数据,并使其看起来像这样:

FIELD              TEXT

P0190001, RACE OF HOUSEHOLDER BY HOUSEHOLD TYPE(8) Universe:Households White Family Households: Married-couple family: With related children

P0190002, RACE OF HOUSEHOLDER BY HOUSEHOLD TYPE(8) Universe:Households White Family Households: Married-couple family: No related children

...等等。(FIELD中第一个有效条目之前的空白条目数并不总是两个,可能或多或少)

对于大型(60,000 个唯一“字段”)csv 文件,是否有一种简单有效的方法来执行此操作?我正在寻找在命令行上执行此操作而不是编写程序的方法。

4

1 回答 1

1

这不是一个命令行解决方案,而是一个有趣的脚本。

import csv

csv_reader = csv.reader(open('data.csv', 'rb'))

# Read first two rows of field text out as a prefix.                            
prefix = ' '.join(csv_reader.next()[2].strip() for i in range(2))

def collapsed_row_iter():
    depth_value_list = []
    for (_, field_id, field_text, _) in csv_reader:
        # Count number of leading <SPACE> chars to determine depth.             
        pre_strip_text_len = len(field_text)
        field_text = field_text.lstrip()
        depth = pre_strip_text_len - len(field_text)

        depth_value_list_len = len(depth_value_list)
        if depth == depth_value_list_len + 1:
            # Append a new depth value.                                            
            depth_value_list.append(field_text.rstrip())

        if depth <= depth_value_list_len:
            # Truncate list to depth, append new value.                         
            del depth_value_list[depth:]
            depth_value_list.append(field_text.rstrip())

        else:
            # Depth value is greater than current_depth + 1                     
            raise ValueError

        # Only yield the row if field_id value is non-NULL.                     
        if field_id:
            yield (field_id, '%s %s' % (prefix, ' '.join(depth_value_list)))

# Get CSV writer object, write the header.                                      
csv_writer = csv.writer(open('collapsed.csv', 'wb'))
csv_writer.writerow(['FIELD', 'TEXT'])

# Iterate over collapsed rows, writing each to the output CSV.                  
for (field_id, collapsed_text) in collapsed_row_iter():
    csv_writer.writerow([field_id, collapsed_text])

输出:

FIELD,TEXT
P0190001,RACE OF HOUSEHOLDER BY HOUSEHOLD TYPE (8) Universe: Households White Family Households: Married-couple family: With related children
P0190002,RACE OF HOUSEHOLDER BY HOUSEHOLD TYPE (8) Universe: Households White Family Households: Married-couple family: No related children
P0190003,"RACE OF HOUSEHOLDER BY HOUSEHOLD TYPE (8) Universe: Households White Family Households: Other family: Male householder, no wife present: With related children"
P0190004,"RACE OF HOUSEHOLDER BY HOUSEHOLD TYPE (8) Universe: Households White Family Households: Other family: Male householder, no wife present: No related children"
于 2013-09-02T22:40:16.823 回答