0

我有一组选项卡类型的数据来清理我的研究。每个数据集不是典型的整齐的逐列格式,而是每个县的选项卡格式(如下所示)

1CURRENT DATE: XXX               AGE,SEX, RACE AND ETHNICITY OF PERSONS  PAGE    1
 BEGINNING DATE FOR DATA TOTALS: 01/83                    COUNTY    001
 ENDING DATE FOR DATA TOTALS: 12/83                                                                       RECORD COUNT    36
              Gender     Age_20    Age_21     Age_22   Age_23    Asian    Hispanic    White
Robbery       F           1          2          2        2         3         3          3
              M           3          3          2        2         4         3          3
Fraud         F           1          2          2        2         3         3          2
              M           2          3          2        2         4         3          3  
Arson         F           1          2          2        2         3         3          3
              M           4          3          2        2         4         3          4

1CURRENT DATE: XXX               AGE,SEX, RACE AND ETHNICITY OF PERSONS  PAGE    4
 BEGINNING DATE FOR DATA TOTALS: 01/83                    COUNTY    002
 ENDING DATE FOR DATA TOTALS: 12/83                                                                       RECORD COUNT    36
              Gender     Age_20    Age_21     Age_22   Age_23    Asian    Hispanic    White
Robbery       F           1          2          2        2         3         3          3
              M           2          3          2        2         4         4          3
Fraud         F           1          2          2        2         3         3          2
              M           2          3          2        2         4         6          3  
Arson         F           1          2          2        2         3         3          3
              M           4          3          2        2         4         3          4

1CURRENT DATE: XXX               AGE,SEX, RACE AND ETHNICITY OF PERSONS  PAGE    7
 BEGINNING DATE FOR DATA TOTALS: 01/83                    COUNTY    003
 ENDING DATE FOR DATA TOTALS: 12/83                                                                       RECORD COUNT    36
              Gender     Age_20    Age_21     Age_22   Age_23    Asian    Hispanic    White
Robbery       F           1          2          2        2         3         3          3
              M           3          3          2        2         4         3          3
Fraud         F           1          2          1        4         3         3          2
              M           2          3          2        2         4         3          3  
Arson         F           1          2          4        2         3         3          3
              M           4          3          2        2         4         3          4

由于其标签类型的性质,我无法将这些数据集直接导入 excel 或 stata 以进行进一步分析。我计划做的是复制并粘贴每个县的 ID(即:COUNTY 003、COUNTY 002 等)和特定类型的犯罪,以创建一个新的类似列的数据集,如下所示:

              Gender     Age_20    Age_21     Age_22   Age_23    Asian    Hispanic    White    County
Robbery       F           1          2          2        2         3         2          3        001
Robbery       F           1          2          2        2         2         3          3        002
Robbery       F           1          2          2        2         3         3          3        003

并进一步清理这个新数据集中的数据。

我在网上搜索,发现Python实际上可以将文件的特定部分复制并粘贴到新文档中。但我对 Python 真的很陌生,我的经验主要是在 Stata 和 SPSS 方面。我不确切知道哪些代码将执行这种类型的复制和粘贴工作。

4

1 回答 1

0

你可能想看看pandas。具体细节将根据您的格式而有所不同,但将您的数据按摩成更清洁的东西并不需要太多。有更漂亮、更少硬编码的方法来执行以下操作,但这里有一个几乎是意识流的例子:

import pandas as pd

# read in a fixed-width file
df = pd.read_fwf("crime.tsv", widths=[14] + [10]*8, header=None)
# clean up the strings
df = df.applymap(lambda x: x.strip() if isinstance(x, basestring) else x)

# make a new column
df["County"] = None
# move over the county information
df["County"][df[5] == "COUNTY"] = df[6]
# fill the county info forwards into the empty places
df["County"].fillna(method='ffill', inplace=True)

# fill the crime information forwards
df[0].fillna(method='ffill', inplace=True)

# reset the columns from one of the examples
df.columns = ["Crime"] + list(df.ix[3,1:-1]) + ["County"]
# get rid of any of the headings left in the table
df = df[~(df["Gender"] == "Gender")]

# toss anything which still has empty cells
df = df.dropna()

# reset the index, and fix the types
df = df.set_index(["Crime", "Gender", "County"]).astype(int)
df = df.reset_index()

产生

>>> df
      Crime Gender County  Age_20  Age_21  Age_22  Age_23  Asian  Hispanic  White
0   Robbery      F    001       1       2       2       2      3         3      3
1   Robbery      M    001       3       3       2       2      4         3      3
2     Fraud      F    001       1       2       2       2      3         3      2
3     Fraud      M    001       2       3       2       2      4         3      3
4     Arson      F    001       1       2       2       2      3         3      3
5     Arson      M    001       4       3       2       2      4         3      4
6   Robbery      F    002       1       2       2       2      3         3      3
7   Robbery      M    002       2       3       2       2      4         4      3
8     Fraud      F    002       1       2       2       2      3         3      2
9     Fraud      M    002       2       3       2       2      4         6      3
10    Arson      F    002       1       2       2       2      3         3      3
11    Arson      M    002       4       3       2       2      4         3      4
12  Robbery      F    003       1       2       2       2      3         3      3
13  Robbery      M    003       3       3       2       2      4         3      3
14    Fraud      F    003       1       2       1       4      3         3      2
15    Fraud      M    003       2       3       2       2      4         3      3
16    Arson      F    003       1       2       4       2      3         3      3
17    Arson      M    003       4       3       2       2      4         3      4

之后我们可以做各种整洁的事情。

于 2013-03-01T04:34:04.907 回答