r - 读取奇怪的格式化程序 CSV 文件

Question

我正在考虑从statistics.gov.scot网站下载一些数据。例如，我想获取一些关于住院率的数据。获取我感兴趣的数据表的查询格式为：

http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall

并通过此链接访问，对于那些想尝试的人。查询会生成一个*.CSV包含相关信息的文件，但是，文件的格式会带来一些挑战。

文件示例

文件内容如下所示：

Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions
measure type,""
Admission Type,""
Age,""
Gender,""
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)"

,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194"

导入 Excel 时：

但是，当通过导入到Rread.csv时，它看起来像这样：

> head(problematicFile)
                                                   V1                        V2
1             Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00
2 http://statistics.gov.scot/data/hospital-admissions       Hospital Admissions
3                                        measure type                          
4                                      Admission Type                          
5                                                 Age                          
6                                              Gender

问题

read.csv导入仅返回两列。我猜这个问题与一些初始列是空的有关。我想以类似于在 Excel 中实现的插图导入的方式读取此文件。关键是，我打算使用A列和B列中第7行的值，当然还有下面的数据表。在生成方面，我很乐意包含有空单元格但尺寸与 Excel 中相同的值。我试过：data.frameNA

read.csv(file = link, header = FALSE, na.strings = "",
                               fill = TRUE)

但我不断遇到同样的问题。

期望的结果

期望的结果应该是这样的（手工生成的提取物）：

Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00   NA  NA  NA  NA  NA  NA  NA
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA  NA  NA  NA  NA  NA  NA
measure type    NA  NA  NA  NA  NA  NA  NA  NA
Admission Type  NA  NA  NA  NA  NA  NA  NA  NA
Age NA  NA  NA  NA  NA  NA  NA  NA
Gender  NA  NA  NA  NA  NA  NA  NA  NA
Measure (cell values):  Ratio (Rate Per 100,000 Population)         NA  NA  NA  NA  NA
NA  NA  NA  NA  NA  NA  NA  NA  NA
NA  NA  http://reference.data.gov.uk/id/year/2002   http://reference.data.gov.uk/id/year/2003   http://reference.data.gov.uk/id/year/2004   http://reference.data.gov.uk/id/year/2005   http://reference.data.gov.uk/id/year/2006   http://reference.data.gov.uk/id/year/2007   http://reference.data.gov.uk/id/year/2008
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area  2002    2003    2004    2005    2006    2007    2008
http://statistics.gov.scot/id/statistical-geography/S92000003   Scotland    9,351   9,262   9,261   9,347   9,723   10,517  10,293
http://statistics.gov.scot/id/statistical-geography/S16000082   Angus South 8,236   8,500   8,523   8,371   8,616   8,978   9,325
http://statistics.gov.scot/id/statistical-geography/S16000106   Edinburgh Northern and Leith    9,040   8,040   7,925   9,042   10,355  11,833  8,916
http://statistics.gov.scot/id/statistical-geography/S16000140   Renfrewshire South  9,391   9,122   9,491   9,586   10,425  10,900  11,065
http://statistics.gov.scot/id/statistical-geography/S16000108   Edinburgh Southern  5,878   5,910   6,101   6,035   7,426   9,343   6,766
http://statistics.gov.scot/id/statistical-geography/S16000075   Aberdeen Donside    10,047  10,963  10,629  10,512  10,383  10,787  10,685
http://statistics.gov.scot/id/statistical-geography/S16000137   Perthshire North    9,388   9,524   7,799   9,350   9,543   9,791   9,991
http://statistics.gov.scot/id/statistical-geography/S16000077   Aberdeenshire East  7,211   7,300   7,153   7,411   7,435   7,268   7,547
http://statistics.gov.scot/id/statistical-geography/S16000114   Galloway and West Dumfries  9,861   9,165   8,143   9,258   7,508   10,213  10,399
http://statistics.gov.scot/id/statistical-geography/S16000096   Dumbarton   8,703   8,570   8,727   9,310   9,389   9,885   10,237

截屏

为了进一步说明，我想保持维度并用NAs 填充缺失值：

score 2 · Accepted Answer

从标头解析元数据有点棘手。您可能更喜欢下载整个标准化数据集而不是那个交叉列表切片。

> reconv <- read.csv("http://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions")

> head(reconv)

  GeographyCode DateCode Measurement                              Units Value Gender Age
1     S92000003     2003        Mean Average reconvictions per offender  0.62    All All
2     S92000003     2004        Mean Average reconvictions per offender  0.33    All All
3     S92000003     2004        Mean Average reconvictions per offender  0.61    All All
4     S92000003     2005        Mean Average reconvictions per offender  0.60    All All
5     S92000003     2006        Mean Average reconvictions per offender  0.60    All All
6     S92000003     2007        Mean Average reconvictions per offender  0.11    All All

这会将所有元数据置于因子级别（因此您不必解析它）：

> str(reconv)

'data.frame':   10119 obs. of  7 variables:
 $ GeographyCode: Factor w/ 26 levels "S12000005","S12000006",..: 26 26 26 26 26 26 26 26 26 26 ...
 $ DateCode     : int  2003 2004 2004 2005 2006 2007 2007 2008 2008 2009 ...
 $ Measurement  : Factor w/ 2 levels "Mean","Ratio": 1 1 1 1 1 1 1 1 1 1 ...
 $ Units        : Factor w/ 2 levels "Average reconvictions per offender",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Value        : num  0.62 0.33 0.61 0.6 0.6 0.11 0.57 0.6 0.33 0.33 ...
 $ Gender       : Factor w/ 3 levels "All","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Age          : Factor w/ 6 levels "21-25","26-30",..: 4 4 4 4 4 4 4 4 4 4 ...

您可以选择您感兴趣的切片：

> slice <- subset(reconv, Measurement=="Ratio" & Gender=="All" & Age=="All")

如果需要，请返回原始交叉列表切片：

> library(reshape2)
> dcast(slice, GeographyCode ~ DateCode, value.var="Value", fun.aggregate = first)

   GeographyCode 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
1      S12000005 41.4 34.3 41.0 40.7 37.4 37.2 33.3 34.6 35.8 33.0 32.8
2      S12000006 34.9 36.0 31.9 34.2 31.1 28.7 27.9 29.6 27.5 26.8 27.0
3      S12000008 33.7 33.2 33.7 33.2 31.7 32.8 30.4 31.5 29.1 28.1 28.7
4      S12000010 26.7 24.5 25.7 26.9 26.7 27.8 29.3 25.1 22.4 29.0 28.2
5      S12000013 31.7 26.1 30.6 35.4 31.6 25.9 24.0 18.9 30.5 22.8 18.6
...

score 1 · Accepted Answer

您需要手动指定col.names以强制 read.csv 读取多列。还指定na.strings为空字符串会将NA值保留在空列中。

read.csv(<parameters>, col.names=c("Col1","Col2".....), na.strings="")

score 0 · Accepted Answer

您可以通过使用 read.table 和提供的列名来指定列数：

read.table(file = link, 
           fill = TRUE,
           sep = ",",
           na.strings = "",
           col.names = paste("c", 1:12, sep = ""))

但是，我不知道这是否是一个好的解决方案，因为您需要先验地知道列数。

另一种方法是将整个 csv 作为字符串读取。然后您可以通过将标题存储在另一个对象（例如列表）中进行预处理，并且您可以只使用“表格部分”作为数据框。

r - 读取奇怪的格式化程序 CSV 文件

文件示例

问题

期望的结果

截屏

3 回答 3

Related

Reference