r - R基于格式为范围（xx-xx）的因子变量对数据框进行子集

Question

我现在面临这个问题好几个小时了，但我知道我遗漏了一些明显的东西。

这是我的问题：

我在 .xlsx 文件中有一个数据框，可以在此处下载。

我在 MAC 上使用 RStudio 将此数据帧加载到 R 中，并将其命名为 demoData。有 5 个变量（AgeRange、Women、Men、Total 和 Year）。

我无法使用 AgeRange 上的条件对该数据框进行子集化。该变量的格式如下：xx-xx（00-04 表示 00 到 04 岁之间的人）。当我尝试这样做时，我得到的消息是没有行满足这个条件。变量“AgeRange”的类是因子。

这是我的代码：

demoData[demoData$AgeRange=="00-04",]

谢谢您的帮助。

编辑：来自阿伦。这是来自的输入head(demoData)：

     Age Feminin Masculin. Ensemble Annee
1 00-04     720       745     1465  2004 
2 05-09     745       767     1512  2004 
3 10-14     813       830     1643  2004 
4 15-19     824       820     1644  2004 
5 20-24     839       823     1662  2004 
6 25-29     752       699     1450  2004 

# str(demoData)
'data.frame':   272 obs. of  5 variables:
 $ Age      : Factor w/ 16 levels "00-04 ","05-09 ",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Feminin  : Factor w/ 216 levels "138 ","139 ",..: 112 124 164 165 174 130 106 86 78 66 ...
 $ Masculin.: Factor w/ 201 levels "120 ","122 ",..: 132 141 174 169 170 124 111 89 90 75 ...
 $ Ensemble : Factor w/ 242 levels "1041 ","1044 ",..: 53 66 115 116 119 50 38 14 9 238 ...
 $ Annee    : Factor w/ 17 levels "2004 ","2005",..: 1 1 1 1 1 1 1 1 1 1 ...

score 1 · Accepted Answer

我在你的 xlsx 文件中读到了 xlsx 包：

df<-read.xlsx("C:/Users/swatson1/Downloads/Evolution_Population_2004_2020.xlsx",1)

它看起来像这样：

> df
        Age Feminin MasculinÂ. Ensemble  Annee
1   00-04Â    720Â       745Â    1465Â  2004Â 
2   05-09Â    745Â       767Â    1512Â  2004Â

你可以替换每一列，去掉多余的字符，比如：

df$Age<-substr(df$Age,1,5)

或者，gsub无论条目的长度如何，都可以在任何列上使用：

df$Age<-gsub("Â\\s","",df$Age)

然后你的代码就可以工作了：

df[df$Age=="00-04",]

score 0 · Accepted Answer

#coppied from the Excel file 
str1 <- "00-04 "
utf8ToInt(str1)
#[1]  48  48  45  48  52 160

字符串末尾似乎有一个不间断的空格。清理您的文件。

您应该能够使用删除不间断空格

df$Age <- gsub(intToUtf8(160),"",df$Age)

r - R基于格式为范围（xx-xx）的因子变量对数据框进行子集

2 回答 2

Related

Reference