python - 正则表达式解析格式良好的多行数据字典

Question

我正在尝试读取和解析人口普查局的美国社区调查公共使用微样本数据发布的数据字典，如此处所示。

它的格式相当好，尽管在插入一些解释性说明的地方有一些失误。

我认为我的首选结果是获取每个变量一行的数据框，并将给定变量的所有值标签序列化到一个字典中，该字典存储在同一行的值字典字段中（尽管不会采用分层 json 格式不好，但更复杂。

我得到以下代码：

 import pandas as pd
 import re
 import urllib2
 data = urllib2.urlopen('http://www.census.gov/acs/www/Downloads/data_documentation/pums/DataDict/PUMSDataDict13.txt')

 ## replace newline characters so we can use dots and find everything until a double 
 ## carriage return (replaced to ||) with a lookahead assertion.
 data=data.replace('\n','|')

 datadict=pd.DataFrame(re.findall("([A-Z]{2,8})\s{2,9}([0-9]{1})\s{2,6}\|\s{2,4}([A-Za-z\-\(\) ]{3,85})",data,re.MULTILINE),columns=['variable','width','description'])
 datadict.head(5)

+----+----------+-------+------------------------------------------------+
|    | variable | width | description                                    |
+----+----------+-------+------------------------------------------------+
| 0  | RT       | 1     | Record Type                                    |
+----+----------+-------+------------------------------------------------+
| 1  | SERIALNO | 7     | Housing unit                                   |
+----+----------+-------+------------------------------------------------+
| 2  | DIVISION | 1     | Division code                                  |
+----+----------+-------+------------------------------------------------+
| 3  | PUMA     | 5     | Public use microdata area code (PUMA) based on |
+----+----------+-------+------------------------------------------------+
| 4  | REGION   | 1     | Region code                                    |
+----+----------+-------+------------------------------------------------+
| 5  | ST       | 2     | State Code                                     |
+----+----------+-------+------------------------------------------------+

到目前为止，一切都很好。变量列表以及每个变量的字符宽度都在那里。

我可以扩展它并获得额外的行（值标签所在的位置），如下所示：

datadict_exp=pd.DataFrame(
re.findall("([A-Z]{2,9})\s{2,9}([0-9]{1})\s{2,6}\|\s{4}([A-Za-z\-\(\)\;\<\> 0-9]{2,85})\|\s{11,15}([a-z0-9]{0,2})[ ]\.([A-Za-z/\-\(\) ]{2,120})",
           data,re.MULTILINE))
 datadict_exp.head(5)

+----+----------+-------+---------------------------------------------------+---------+--------------+
| id | variable | width | description                                       | value_1 | label_1      |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 0  | DIVISION | 1     | Division code                                     | 0       | Puerto Rico  |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 1  | REGION   | 1     | Region code                                       | 1       | Northeast    |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 2  | ST       | 2     | State Code                                        | 1       | Alabama/AL   |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 3  | NP       | 2     | Number of person records following this housin... | 0       | Vacant unit  |
+----+----------+-------+---------------------------------------------------+---------+--------------+
| 4  | TYPE     | 1     | Type of unit                                      | 1       | Housing unit |
+----+----------+-------+---------------------------------------------------+---------+--------------+

这样就得到了第一个值和相关的标签。我的正则表达式问题是如何从头到尾重复多行匹配\s{11,15}——即一些变量有大量的唯一值（ST或者state code后面跟着大约 50 行，表示每个状态的值和标签）。

我很早就用管道在源文件中的回车上进行了更改，以为我可以无耻地依靠点来匹配所有内容，直到双回车，指示该特定变量的结束，这就是我卡住的地方。

那么——如何将多行模式重复任意次数。

（后面的复杂情况是字典中没有完全列举一些变量，但显示了有效的值范围。NP例如[与同一家庭相关的人数]，用 `02..20` 表示描述。如果我不考虑这一点，我的解析当然会错过这些条目。）

score 1 · Accepted Answer

这不是正则表达式，但我使用下面的 Python 3x 脚本解析PUMSDataDict2013.txt和PUMS_Data_Dictionary_2009-2013.txt（Census ACS 2013 文档，FTP 服务器）。我使用pandas.DataFrame.from_dictandpandas.concat创建了一个分层数据框，也在下面。

用于解析PUMSDataDict2013.txt和的 Python 3x 函数PUMS_Data_Dictionary_2009-2013.txt：

import collections
import os


def parse_pumsdatadict(path:str) -> collections.OrderedDict:
    r"""Parse ACS PUMS Data Dictionaries.

    Args:
        path (str): Path to downloaded data dictionary.

    Returns:
        ddict (collections.OrderedDict): Parsed data dictionary with original
            key order preserved.

    Raises:
        FileNotFoundError: Raised if `path` does not exist.

    Notes:
        * Only some data dictionaries have been tested.[^urls]
        * Values are all strings. No data types are inferred from the
            original file.
        * Example structure of returned `ddict`:
            ddict['title'] = '2013 ACS PUMS DATA DICTIONARY'
            ddict['date'] = 'August 7, 2015'
            ddict['record_types']['HOUSING RECORD']['RT']\
                ['length'] = '1'
                ['description'] = 'Record Type'
                ['var_codes']['H'] = 'Housing Record or Group Quarters Unit'
            ddict['record_types']['HOUSING RECORD'][...]
            ddict['record_types']['PERSON RECORD'][...]
            ddict['notes'] =
                ['Note for both Industry and Occupation lists...',
                 '*  In cases where the SOC occupation code ends...',
                 ...]

    References:
        [^urls]: http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/
            PUMSDataDict2013.txt
            PUMS_Data_Dictionary_2009-2013.txt

    """
    # Check arguments.
    if not os.path.exists(path):
        raise FileNotFoundError(
            "Path does not exist:\n{path}".format(path=path))
    # Parse data dictionary.
    # Note:
    # * Data dictionary keys and values are "codes for variables",
    #   using the ACS terminology,
    #   https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.html
    # * The data dictionary is not all encoded in UTF-8. Replace encoding
    #   errors when found.
    # * Catch instances of inconsistently formatted data.
    ddict = collections.OrderedDict()
    with open(path, encoding='utf-8', errors='replace') as fobj:
        # Data dictionary name is line 1.
        ddict['title'] = fobj.readline().strip()
        # Data dictionary date is line 2.
        ddict['date'] = fobj.readline().strip()    
        # Initialize flags to catch lines.
        (catch_var_name, catch_var_desc,
         catch_var_code, catch_var_note) = (None, )*4
        var_name = None
        var_name_last = 'PWGTP80' # Necessary for unformatted end-of-file notes.
        for line in fobj:
            # Replace tabs with 4 spaces
            line = line.replace('\t', ' '*4).rstrip()
            # Record type is section header 'HOUSING RECORD' or 'PERSON RECORD'.
            if (line.strip() == 'HOUSING RECORD'
                or line.strip() == 'PERSON RECORD'):
                record_type = line.strip()
                if 'record_types' not in ddict:
                    ddict['record_types'] = collections.OrderedDict()
                ddict['record_types'][record_type] = collections.OrderedDict()
            # A newline precedes a variable name.
            # A newline follows the last variable code.
            elif line == '':
                # Example inconsistent format case:
                # WGTP54     5
                #     Housing Weight replicate 54
                #
                #           -9999..09999 .Integer weight of housing unit
                if (catch_var_code
                    and 'var_codes' not in ddict['record_types'][record_type][var_name]):
                    pass
                # Terminate the previous variable block and look for the next
                # variable name, unless past last variable name.
                else:
                    catch_var_code = False
                    catch_var_note = False
                    if var_name != var_name_last:
                        catch_var_name = True
            # Variable name is 1 line with 0 space indent.
            # Variable name is followed by variable description.
            # Variable note is optional.
            # Variable note is preceded by newline.
            # Variable note is 1+ lines.
            # Variable note is followed by newline.
            elif (catch_var_name and not line.startswith(' ') 
                and var_name != var_name_last):
                # Example: "Note: Public use microdata areas (PUMAs) ..."
                if line.lower().startswith('note:'):
                    var_note = line.strip() # type(var_note) == str
                    if 'notes' not in ddict['record_types'][record_type][var_name]:
                        ddict['record_types'][record_type][var_name]['notes'] = list()
                    # Append a new note.
                    ddict['record_types'][record_type][var_name]['notes'].append(var_note)
                    catch_var_note = True
                # Example: """
                # Note: Public Use Microdata Areas (PUMAs) designate areas ...
                # population.  Use with ST for unique code. PUMA00 applies ...
                # ...
                # """
                elif catch_var_note:
                    var_note = line.strip() # type(var_note) == str
                    if 'notes' not in ddict['record_types'][record_type][var_name]:
                        ddict['record_types'][record_type][var_name]['notes'] = list()
                    # Concatenate to most recent note.
                    ddict['record_types'][record_type][var_name]['notes'][-1] += ' '+var_note
                # Example: "NWAB       1 (UNEDITED - See 'Employment Status Recode' (ESR))"
                else:
                    # type(var_note) == list
                    (var_name, var_len, *var_note) = line.strip().split(maxsplit=2)
                    ddict['record_types'][record_type][var_name] = collections.OrderedDict()
                    ddict['record_types'][record_type][var_name]['length'] = var_len
                    # Append a new note if exists.
                    if len(var_note) > 0:
                        if 'notes' not in ddict['record_types'][record_type][var_name]:
                            ddict['record_types'][record_type][var_name]['notes'] = list()
                        ddict['record_types'][record_type][var_name]['notes'].append(var_note[0])
                    catch_var_name = False
                    catch_var_desc = True
                    var_desc_indent = None
            # Variable description is 1+ lines with 1+ space indent.
            # Variable description is followed by variable code(s).
            # Variable code(s) is 1+ line with larger whitespace indent
            # than variable description. Example:"""
            # PUMA00     5      
            #     Public use microdata area code (PUMA) based on Census 2000 definition for data
            #     collected prior to 2012. Use in combination with PUMA10.          
            #           00100..08200 .Public use microdata area codes 
            #                   77777 .Combination of 01801, 01802, and 01905 in Louisiana
            #             -0009 .Code classification is Not Applicable because data 
            #                         .collected in 2012 or later            
            # """
            # The last variable code is followed by a newline.
            elif (catch_var_desc or catch_var_code) and line.startswith(' '):
                indent = len(line) - len(line.lstrip())
                # For line 1 of variable description.
                if catch_var_desc and var_desc_indent is None:
                    var_desc_indent = indent
                    var_desc = line.strip()
                    ddict['record_types'][record_type][var_name]['description'] = var_desc
                # For lines 2+ of variable description.
                elif catch_var_desc and indent <= var_desc_indent:
                    var_desc = line.strip()
                    ddict['record_types'][record_type][var_name]['description'] += ' '+var_desc
                # For lines 1+ of variable codes.
                else:
                    catch_var_desc = False
                    catch_var_code = True
                    is_valid_code = None
                    if not line.strip().startswith('.'):
                        # Example case: "01 .One person record (one person in household or"
                        if ' .' in line:
                            (var_code, var_code_desc) = line.strip().split(
                                sep=' .', maxsplit=1)
                            is_valid_code = True
                        # Example inconsistent format case:"""
                        #            bbbb. N/A (age less than 15 years; never married)
                        # """
                        elif '. ' in line:
                            (var_code, var_code_desc) = line.strip().split(
                                sep='. ', maxsplit=1)
                            is_valid_code = True
                        else:
                            raise AssertionError(
                                "Program error. Line unaccounted for:\n" +
                                "{line}".format(line=line))
                        if is_valid_code:
                            if 'var_codes' not in ddict['record_types'][record_type][var_name]:
                                ddict['record_types'][record_type][var_name]['var_codes'] = collections.OrderedDict()
                            ddict['record_types'][record_type][var_name]['var_codes'][var_code] = var_code_desc
                    # Example case: ".any person in group quarters)"
                    else:
                        var_code_desc = line.strip().lstrip('.')
                        ddict['record_types'][record_type][var_name]['var_codes'][var_code] += ' '+var_code_desc
            # Example inconsistent format case:"""
            # ADJHSG     7      
            # Adjustment factor for housing dollar amounts (6 implied decimal places)
            # """
            elif (catch_var_desc and
                'description' not in ddict['record_types'][record_type][var_name]):
                var_desc = line.strip()
                ddict['record_types'][record_type][var_name]['description'] = var_desc
                catch_var_desc = False
                catch_var_code = True
            # Example inconsistent format case:"""
            # WGTP10     5
            #     Housing Weight replicate 10
            #           -9999..09999 .Integer weight of housing unit
            # WGTP11     5
            #     Housing Weight replicate 11
            #           -9999..09999 .Integer weight of housing unit
            # """
            elif ((var_name == 'WGTP10' and 'WGTP11' in line)
                or (var_name == 'YOEP12' and 'ANC' in line)):
                # type(var_note) == list
                (var_name, var_len, *var_note) = line.strip().split(maxsplit=2)
                ddict['record_types'][record_type][var_name] = collections.OrderedDict()
                ddict['record_types'][record_type][var_name]['length'] = var_len
                if len(var_note) > 0:
                    if 'notes' not in ddict['record_types'][record_type][var_name]:
                        ddict['record_types'][record_type][var_name]['notes'] = list()
                    ddict['record_types'][record_type][var_name]['notes'].append(var_note[0])
                catch_var_name = False
                catch_var_desc = True
                var_desc_indent = None
            else:
                if (catch_var_name, catch_var_desc,
                    catch_var_code, catch_var_note) != (False, )*4:
                    raise AssertionError(
                        "Program error. All flags to catch lines should be set " +
                        "to `False` by end-of-file.")
                if var_name != var_name_last:
                    raise AssertionError(
                        "Program error. End-of-file notes should only be read "+
                        "after `var_name_last` has been processed.")
                if 'notes' not in ddict:
                    ddict['notes'] = list()
                ddict['notes'].append(line)
    return ddict

创建分层数据框（以下格式为 Jupyter Notebook 单元格）：

In [ ]:
import pandas as pd
ddict = parse_pumsdatadict(path=r'/path/to/PUMSDataDict2013.txt')
tmp = dict()
for record_type in ddict['record_types']:
    tmp[record_type] = pd.DataFrame.from_dict(ddict['record_types'][record_type], orient='index')
df_ddict = pd.concat(tmp, names=['record_type', 'var_name'])
df_ddict.head()

Out[ ]:
# Click "Run code snippet" below to render the output from `df_ddict.head()`.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th></th>
      <th>length</th>
      <th>description</th>
      <th>var_codes</th>
      <th>notes</th>
    </tr>
    <tr>
      <th>record_type</th>
      <th>var_name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="5" valign="top">HOUSING RECORD</th>
      <th>ACCESS</th>
      <td>1</td>
      <td>Access to the Internet</td>
      <td>{'b': 'N/A (GQ)', '1': 'Yes, with subscription...</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>ACR</th>
      <td>1</td>
      <td>Lot size</td>
      <td>{'b': 'N/A (GQ/not a one-family house or mobil...</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>ADJHSG</th>
      <td>7</td>
      <td>Adjustment factor for housing dollar amounts (...</td>
      <td>{'1000000': '2013 factor (1.000000)'}</td>
      <td>[Note: The value of ADJHSG inflation-adjusts r...</td>
    </tr>
    <tr>
      <th>ADJINC</th>
      <td>7</td>
      <td>Adjustment factor for income and earnings doll...</td>
      <td>{'1007549': '2013 factor (1.007549)'}</td>
      <td>[Note: The value of ADJINC inflation-adjusts r...</td>
    </tr>
    <tr>
      <th>AGS</th>
      <td>1</td>
      <td>Sales of Agriculture Products (Yearly sales)</td>
      <td>{'b': 'N/A (GQ/vacant/not a one family house o...</td>
      <td>[Note: no adjustment factor is applied to AGS.]</td>
    </tr>
  </tbody>
</table>

python - 正则表达式解析格式良好的多行数据字典

1 回答 1

Related

Reference