我正在使用交通部的一些大型政府数据集,这些数据集以制表符分隔的文本文件形式提供,并附有数据字典。例如,汽车投诉文件是一个 670Mb 的未标记数据文件(解压缩时),并带有一个字典。以下是一些摘录:
Last updated: April 24, 2014
FIELDS:
=======
Field# Name Type/Size Description
------ --------- --------- --------------------------------------
1 CMPLID CHAR(9) NHTSA'S INTERNAL UNIQUE SEQUENCE NUMBER.
IS AN UPDATEABLE FIELD,THUS DATA FOR A
GIVEN RECORD POTENTIALLY COULD CHANGE FROM
ONE DATA OUTPUT FILE TO THE NEXT.
2 ODINO CHAR(9) NHTSA'S INTERNAL REFERENCE NUMBER.
THIS NUMBER MAY BE REPEATED FOR
MULTIPLE COMPONENTS.
ALSO, IF LDATE IS PRIOR TO DEC 15, 2002,
THIS NUMBER MAY BE REPEATED FOR MULTIPLE
PRODUCTS OWNED BY THE SAME COMPLAINANT.
一些字段的外键如下所示:
21 CMPL_TYPE CHAR(4) SOURCE OF COMPLAINT CODE:
CAG =CONSUMER ACTION GROUP
CON =FORWARDED FROM A CONGRESSIONAL OFFICE
DP =DEFECT PETITION,RESULT OF A DEFECT PETITION
EVOQ =HOTLINE VOQ
EWR =EARLY WARNING REPORTING
INS =INSURANCE COMPANY
IVOQ =NHTSA WEB SITE
LETR =CONSUMER LETTER
MAVQ =NHTSA MOBILE APP
MIVQ =NHTSA MOBILE APP
MVOQ =OPTICAL MARKED VOQ
RC =RECALL COMPLAINT,RESULT OF A RECALL INVESTIGATION
RP =RECALL PETITION,RESULT OF A RECALL PETITION
SVOQ =PORTABLE SAFETY COMPLAINT FORM (PDF)
VOQ =NHTSA VEHICLE OWNERS QUESTIONNAIRE
Microsoft Access有导入说明,我没有,如果有,也不会使用。但我认为这个数据字典应该是机器可读的。
我的问题:这个数据字典是某种标准格式吗?我试过用谷歌搜索,但如果没有正确的术语,很难做到这一点。我想导入R,但我很灵活,只要它可以以编程方式完成。