1

我正在处理一个大型数据框,我想对其执行数据透视表类型的函数。我一直在尝试使用 reshape2 包,但由于某种原因,我的熔融数据框没有重塑。

我想采取这样的框架:

County      Industry   Type    Variable   Value 
LA          Plumbing     Tax       Rev       1000 
LA          Plumbing     No tax    Emp       100 
LA          Plumbing     Tax       Pay       500

并使其成为(聚合类型):

        Plumbing       Tailors
County  Rev   Emp  Pay Rev   Emp  Pay
LA      1000  100  500 1000  50   65

我正在运行以下代码:

dcast(m.data, county ~ variable + industry)

但它根本没有改变数据框。我在哪里搞砸了?

编辑:

我将包含有关此问题的更多信息。在到达熔融数据框之前,我正在做一些非常糟糕的清理工作以将数据放到我想要的位置。我知道下面的代码并不理想,确实应该修复,但我基本上是上传多个 CSV 文件(具有相同的列名),加入它们,重新编码一些值,删除一些列,选择数据的子集,转将其放入熔化的框架中,然后尝试使用 dcast 对其进行重塑。我删除了我重新编码某些值的代码,但那部分似乎工作正常。这是一些代码:

data1 <- read.table("census_data_r_1.csv",header=TRUE,sep=",",stringsAsFactors=FALSE) 
data2 <- read.table("census_data_r_2.csv",header=TRUE,sep=",", stringsAsFactors=FALSE)
fulldata <- rbind(data1,data2)
delete <- c("GEO.id","GEO.id2","NAICS.id","OPTAX.id","YEAR.id")
fulldata <- fulldata[, !(names(fulldata) %in% delete)]
colnames(fulldata) <- c("county","industry","tax_type","firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")
fulldata[c("firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")] <- recode.variables(fulldata[c("firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")],"'N' -> 'Nothing';'D' -> 'Withheld';'b' -> 20;'c' -> 100;'e' -> 250;'a' -> 10;'g' -> 1000;'f' -> 500;'Q' -> 'No Rev Collected';'h' -> 2500;'i' -> 5000;'j' -> 10000;'l' -> 50000;'k' -> 25000;'S' -> 'Bad Data';'m' -> 100000;")
fulldata.sub <- subset(fulldata, subset = (tax_type %in% c('Total', 'All establishments')) & (!(revenue %in% c('Nothing', 'Withheld','No Rev Collected'))) & (!(non_emp_firms %in% c('Nothing','Withheld'))))
m.data <- melt(fulldata.sub, id.vars = 1:3)
dcast(m.data, county ~ variable, sum)

现在我收到以下错误:

结构错误(有序,暗淡 = ns):暗淡 [产品 18300] 与对象 [0] 的长度不匹配

输出dput(head(fulldata.sub,40))

structure(list(county = c("Autauga County, Alabama", "Autauga County, Alabama", 
"Autauga County, Alabama", "Autauga County, Alabama", "Autauga County, Alabama", 
"Autauga County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Barbour County, Alabama", "Barbour County, Alabama"), industry = c("Rental and leasing services", 
"Professional, scientific, and technical services", "Professional, scientific, and technical services", 
"Accounting, tax preparation, bookkeeping, and payroll services", 
"Accounting, tax preparation, bookkeeping, and payroll services", 
"Architectural, engineering, and related services", "Real estate and rental and leasing", 
"Real estate", "Lessors of real estate", "Offices of real estate agents and brokers", 
"Offices of real estate agents and brokers", "Activities related to real estate", 
"Real estate property managers", "Offices of real estate appraisers", 
"Consumer goods rental", "Accounting, tax preparation, bookkeeping, and payroll services", 
"Accounting, tax preparation, bookkeeping, and payroll services", 
"Offices of certified public accountants", "Tax preparation services", 
"Architectural, engineering, and related services", "Architectural services", 
"Engineering services", "Specialized design services", "Computer systems design and related services", 
"Computer systems design and related services", "Management, scientific, and technical consulting services", 
"Advertising, public relations, and related services", "Veterinary services", 
"Administrative and support and waste management and remediation services", 
"Administrative and support services", "Employment services", 
"Business support services", "Investigation and security services", 
"Services to buildings and dwellings", "Exterminating and pest control services", 
"Janitorial services", "Landscaping services", "Waste management and remediation services", 
"Lessors of real estate", "Legal services"), tax_type = c("Total", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "Total", "Total", 
"Total", "Total", "Total", "Total", "Total", "Total", "Total", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "Total", "Total", "Total", "Total", "Total", 
"Total", "Total", "Total", "Total", "Total", "Total", "All establishments"
), firms = c("10", "61", "61", "14", "14", "10", "358", "312", 
"77", "161", "161", "74", "52", "16", "28", "79", "79", "36", 
"20", "77", "13", "37", "19", "27", "27", "63", "17", "26", "250", 
"238", "26", "14", "17", "157", "16", "29", "96", "12", "11", 
"19"), revenue = c("8433", "42285", "42285", "8581", "8581", 
"5571", "266692", "201777", "59742", "104768", "104768", "37267", 
"32141", "4615", "20691", "33203", "33203", "19805", "3160", 
"39318", "10494", "21167", "6833", "12391", "12391", "21496", 
"11097", "18388", "163661", "145935", "30746", "4048", "13849", 
"77076", "9934", "15832", "47411", "17726", "1585", "6439"), 
    payroll = c("1641", "15473", "15473", "3506", "3506", "2229", 
    "59476", "47937", "4053", "30180", "30180", "13704", "11902", 
    "1674", "4854", "17298", "17298", "9718", "1122", "15263", 
    "3688", "8649", "908", "4429", "4429", "7335", "2634", "6073", 
    "67526", "62354", "19529", "1002", "6824", "27688", "3181", 
    "8632", "14434", "5172", "265", "1431"), num_employees = c("56", 
    "386", "386", "127", "127", "41", "1987", "1643", "160", 
    "1030", "1030", "453", "406", "42", "217", "491", "491", 
    "217", "138", "356", "69", "204", "45", "111", "111", "165", 
    "101", "282", "2807", "2686", "806", "53", "399", "1241", 
    "110", "399", "675", "121", "23", "36"), non_emp_firms = c("8", 
    "330", "330", "49", "49", "35", "2358", "2289", "648", "840", 
    "840", "801", "186", "32", "19", "208", "208", "20", "40", 
    "203", "21", "74", "107", "99", "99", "356", "82", "10", 
    "1452", "1435", "25", "153", "61", "982", "12", "526", "350", 
    "17", "40", "16"), non_emp_firms_rev = c("882", "10111", 
    "10111", "493", "493", "1280", "164778", "160968", "55888", 
    "33321", "33321", "71759", "25870", "1504", "692", "2961", 
    "2961", "533", "466", "9220", "889", "5387", "4448", "3235", 
    "3235", "14395", "10337", "602", "35998", "33953", "708", 
    "3991", "806", "18726", "329", "6246", "9974", "2045", "1978", 
    "488")), .Names = c("county", "industry", "tax_type", "firms", 
"revenue", "payroll", "num_employees", "non_emp_firms", "non_emp_firms_rev"
), row.names = c(6L, 7L, 9L, 19L, 21L, 25L, 54L, 55L, 56L, 65L, 
66L, 70L, 71L, 74L, 77L, 99L, 101L, 103L, 105L, 109L, 111L, 115L, 
119L, 125L, 127L, 131L, 139L, 143L, 147L, 148L, 152L, 155L, 159L, 
162L, 163L, 165L, 167L, 169L, 174L, 180L), class = "data.frame")

编辑

再进行一次编辑,包括 >str(fulldata.sub) 和 str(m.data) 的输出

data.frame':130098 obs。9 个变量:
$ 县:因子 w/3237 级别“南卡罗来纳州阿布维尔县”,..:121 121 121 121 121 121 121 121 131 131 ...
$ 行业:因子 w/369 级别“会计,税务准备,簿记和工资服务”,..:283 239 239 1 1 33 358 358 274 273 ...
$ tax_type :因子 w/ 4 级“所有机构”,..:4 1 1 1 1 1 1 1 4 4 。 ..
$ 公司 : num 10 61 61 14 14 10 4 4 358 312 ...
$ 收入 : num 31466 21347 21347 31717 31717 ...
$ 工资单 : num 5521 4863 4863 13729 13729 ...
$ num_employees : 362546 571 571 ...
$ non_emp_firms : 数字 3122 1887 1887 2486 2486 ...
$ non_emp_firms_rev:数字 17550 96 96 12669 12669 ...
'data.frame':780588 obs。5 个变量:
$ 县:因子 w/ 3237 级别“南卡罗来纳州阿布维尔县”,..:121 121 121 121 121 121 121 121 131 131 ...
$ 行业:因子 w/ 369 级别“会计,税务准备,簿记和工资服务”,..:283 239 239 1 1 33 358 358 274 273 ...
$ tax_type:因子 w/4 级别“所有机构”,..:4 1 1 1 1 1 1 1 4 4 。 ..
$变量:因子w / 6个级别“公司”,“收入”,..:1 1 1 1 1 1 1 1 1 1 ...
$值:num 10 61 61 14 14 10 4 4 358 312 .. .

4

1 回答 1

2

如果您看一下,str(fulldata.sub)您会发现第 4 到第 9 列中的数字被视为字符。因此,当您melt()将字符串转换为因子时。然后,您尝试对类型为 factor 的变量运行 sum() 评估 - 它不会计算。

您可以只转换为数字,如下所示:

...    
fulldata.sub[4:9] <- sapply(fulldata.sub[4:9],as.numeric)
# Then run your melt/cast sequence
m.data <- melt(fulldata.sub, id.vars = 1:3)
dcast(m.data, county ~ variable, sum)

或者,修复数据导入。这可能是因为存在诸如“”、“-”、、”、“n/a”、“na”或“”等字符串。使用read.csv您可以通过设置na.strings=c("erroneous_string","other_erroneous_string",...)参数来解决此问题。

于 2012-11-13T01:12:09.393 回答