r - 按因子计数的子集

Question

我正在使用内布拉斯加州城市的统一犯罪报告数据（一个慷慨的分类），并以 5 年为增量计算了 1995 年至 2010 年主要分类的犯罪率。

我想绘制多年来的犯罪率。然而，由于 UCR 的报告方式，并非所有城市都报告了所有年份的价值。

我对 R 相当陌生，但一位同事建议我尝试创建一个 for 循环，以计算每个城市名称的唯一值。然后我可以使用这些计数来删除数据或对数据进行子集化，以便我至少有至少三个观察值可用于绘图。这大约是我所得到的，并且那里的东西不起作用。不幸的是，我需要在本周剩下的时间里关注一些更紧迫的问题，所以我想我会把它扔给社区以获得一些见解。

代码和名称数据如下。谢谢。

drop = NULL
city.names <- unique(cnames)

for (i in 1:length(city.names)){
  x = sum(cnames==i)
 if (x < 3) {c(drop,i)}
}

有 191 个观测值，有 64 个唯一名称。数据是 csv 并导入为

data <- read.csv("cities.csv", header=TRUE, sep=",")

"","year","cnames" "1",1995,"Beatrice" "2",1995,"Bellevue" "3",1995,"Columbus" "4",1995,"Fremont" "5",1995,"Grand Island" "6",1995,"Hastings" "7",1995,"Kearney" "8",1995,"La Vista" "9",1995,"Lincoln" "10",1995,"Norfolk" "11",1995,"North Platte" "12",1995,"Omaha" "13",1995,"Papillion" "14",1995,"Scottsbluff" "15",1995,"South Sioux City" "16",2000,"Bellevue" "17",2000,"Columbus" "18",2000,"Fremont" "19",2000,"Grand Island" "20",2000,"Hastings" "21",2000,"Kearney" "22",2000,"La Vista" "23",2000,"Lincoln" "24",2000,"Norfolk" "25",2000,"Omaha" "26",2000,"Papillion" "27",2000,"Scottsbluff" "28",2000,"South Sioux City" "29",2005,"Alliance" "30",2005,"Ashland" "31",2005,"Auburn" "32",2005,"Bayard" "33",2005,"Beatrice" "34",2005,"Bellevue" "35",2005,"Blair" "36",2005,"Bridgeport" "37",2005,"Broken Bow" "38",2005,"Central City" "39",2005,"Chadron" "40",2005,"Columbus" "41",2005,"Cozad" "42",2005,"Crete" "43",2005,"David City" "44",2005,"Elkhorn" "45",2005,"Falls City" "46",2005,"Fremont" "47",2005,"Gering" "48",2005,"Gothenburg" "49",2005,"Grand Island" "50",2005,"Hastings" "51",2005,"Holdrege" "52",2005,"Imperial" "53",2005,"Kearney" "54",2005,"La Vista" "55",2005,"Lexington" "56",2005,"Lincoln" "57",2005,"Lyons" "58",2005,"Madison" "59",2005,"McCook" "60",2005,"Milford" "61",2005,"Minden" "62",2005,"Mitchell" "63",2005,"Nebraska City" "64",2005,"Norfolk" "65",2005,"North Platte" "66",2005,"Ogallala" "67",2005,"Omaha" "68",2005,"O'Neill" "69",2005,"Ord" "70",2005,"Papillion" "71",2005,"Plainview" "72",2005,"Plattsmouth" "73",2005,"Ralston" "74",2005,"Schuyler" "75",2005,"Scottsbluff" "76",2005,"Seward" "77",2005,"Sidney" "78",2005,"South Sioux City" "79",2005,"St. Paul" "80",2005,"Superior" "81",2005,"Valley" "82",2005,"Wahoo" "83",2005,"West Point" "84",2005,"Wymore" "85",2005,"York" "86",2010,"Alliance" "87",2010,"Ashland" "88",2010,"Auburn" "89",2010,"Aurora" "90",2010,"Bayard" "91",2010,"Beatrice" "92",2010,"Bellevue" "93",2010,"Bennington" "94",2010,"Blair" "95",2010,"Bridgeport" "96",2010,"Broken Bow" "97",2010,"Central City" "98",2010,"Chadron" "99",2010,"Columbus" "100",2010,"Cozad" "101",2010,"Crete" "102",2010,"Falls City" "103",2010,"Fremont" "104",2010,"Gering" "105",2010,"Gothenburg" "106",2010,"Grand Island" "107",2010,"Hastings" "108",2010,"Holdrege" "109",2010,"Imperial" "110",2010,"Kearney" "111",2010,"La Vista" "112",2010,"Lexington" "113",2010,"Lincoln" "114",2010,"Lyons" "115",2010,"Madison" "116",2010,"McCook" "117",2010,"Milford" "118",2010,"Minden" "119",2010,"Nebraska City" "120",2010,"Norfolk" "121",2010,"North Platte" "122",2010,"Ogallala" "123",2010,"Omaha" "124",2010,"O'Neill" "125",2010,"Papillion" "126",2010,"Plainview" "127",2010,"Plattsmouth" "128",2010,"Ralston" "129",2010,"Scottsbluff" "130",2010,"Seward" "131",2010,"Sidney" "132",2010,"South Sioux City" "133",2010,"Superior" "134",2010,"Valentine" "135",2010,"Valley" "136",2010,"Wahoo" "137",2010,"Wayne" "138",2010,"West Point" "139",2010,"Wilber" "140",2010,"York" "141",2013,"Alliance" "142",2013,"Ashland" "143",2013,"Aurora" "144",2013,"Beatrice" "145",2013,"Bellevue" "146",2013,"Bennington" "147",2013,"Blair" "148",2013,"Bridgeport" "149",2013,"Broken Bow" "150",2013,"Central City" "151",2013,"Chadron" "152",2013,"Columbus" "153",2013,"Cozad" "154",2013,"Crete" "155",2013,"Falls City" "156",2013,"Fremont" "157",2013,"Gering" "158",2013,"Gordon" "159",2013,"Gothenburg" "160",2013,"Grand Island" "161",2013,"Hastings" "162",2013,"Holdrege" "163",2013,"Imperial" "164",2013,"Kearney" "165",2013,"Kimball" "166",2013,"La Vista" "167",2013,"Lexington" "168",2013,"Lincoln" "169",2013,"Madison" "170",2013,"McCook" "171",2013,"Milford" "172",2013,"Minden" "173",2013,"Mitchell" "174",2013,"Nebraska City" "175",2013,"Norfolk" "176",2013,"Ogallala" "177",2013,"Omaha" "178",2013,"O'Neill" "179",2013,"Papillion" "180",2013,"Plattsmouth" "181",2013,"Ralston" "182",2013,"Scottsbluff" "183",2013,"Seward" "184",2013,"South Sioux City" "185",2013,"Superior" "186",2013,"Valentine" "187",2013,"Valley" "188",2013,"Wahoo" "189",2013,"West Point" "190",2013,"Wilber" "191",2013,"York"

score 2 · Accepted Answer

对于按列的“频率”进行子集，在其他包中和其他包中有许多选项base R。一种选择是table在“cnames”列上使用函数并获取频率。输出将是一个与每个唯一“cnames”vector对应的“key/values ”。names/frequency检查这些值是否小于 3 ( tbl <3)，这给出了“真/假”的逻辑索引。使用该索引对“tbl”的名称进行子集化，并使用该索引对“cnames”列进行索引%in%。我展示了两种方法，一种带有 negation ( !) 和 using <，另一种带有>=

 tbl <- table(data$cnames)
 data[!data$cnames %in% names(tbl)[tbl <3],]

或者

 data[data$cnames %in% names(tbl)[tbl >=3],]

或者使用ave获取length每个唯一的“cnames”并通过>=运算符获取逻辑索引。 ave以与原始数据集中相同的顺序返回输出。这可以用于子集。

 data[with(data, ave(seq_along(cnames), cnames, FUN=length)>=3),]

如果您使用data.table，代码将更紧凑，并且对于大数据集更快。使用将“data.frame”转换为“data.table” ，为每个唯一的“cnames”setDT分配计数 ( )，最后使用n:=.N>=

library(data.table)
setDT(data)[,n:=.N, cnames][n>=3]

r - 按因子计数的子集

1 回答 1

Related

Reference