我有一个 csv 文件,它有 190 万行和 32 列。我的 RAM 也有限,这使得它加载到内存中非常不方便。结果,我正在考虑使用数据库,但对该主题没有任何深入的了解,因此已经查看了该站点,但到目前为止还没有找到可行的解决方案。
CSV 文件如下所示:
Case,Event,P01,P02,P03,P04,P05,P06,P07,P08,P09,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30
C000039,E97553,8,10,90,-0.34176313227395744,-5.581162038780728E-4,-0.12090388100201072,-1.5172412910939355,-0.9075283173030568,2.0571877671625742,-0.002902632819930783,-0.6761896565590585,-0.7258602353522214,0.8684602429202587,0.0023189312896576167,0.002318939470525324,-0.1881462494296103,-0.0014303471592995315,-0.03133299206977217,7.72338072867324E-4,-0.08952068388668191,-1.4536398437657685,-0.020065144945600275,-0.16276139919188118,0.6915962670997067,-1.593412697264055,-1.563877781707804,-1.4921751129092755,4.701551108078644,6,-0.688302560842075
C000039,E23039,8,10,90,-0.3420173545012358,-5.581162038780728E-4,-1.6563770995734233,-1.5386562526752448,-1.3604342580422861,2.1025445031625525,-0.0028504751366762804,-0.6103972392687121,-2.0390388918403284,-1.7249948885013526,0.00231891181914203,0.0023189141684282384,-0.18603688853814693,-0.0014303471592995315,-0.03182759137355937,0.001011754948131039,0.13009444290656555,-1.737249614361576,-0.015763602969926262,-0.16276139919188118,0.7133868949811379,-1.624962995908364,-1.5946762525901037,-1.5362787555380522,4.751479927607516,6,-0.688302560842075
C000039,E23039,35,10,90,-0.3593468363273839,-5.581162038780728E-4,-2.2590624066428937,-1.540784192984501,-1.3651511418164592,0.05539868728273849,-0.00225912499740972,0.20899232681704485,-2.2007336302050633,-2.518401278903022,0.0023189850665203673,0.0023189834133465186,-0.1386548782028836,-0.0013092574968056093,-0.0315006293688149,9.042390365542781E-4,-0.3514180333671346,-1.8007561969675518,-0.008593259125791147,-2.295351187387221,0.6329101442826701,-1.8095530459660578,-1.7748676145152822,-1.495347406256394,2.553693742122162,34,-0.6882806822066699
.... .... 多达 190 万行
如您所见,“案例”列重复出现,但我只想在将其导入数据框之前获取唯一记录。所以我用了这个:
f<-file("test.csv")
bigdf <- sqldf("select * from 'f' where Case in (select Case from 'f' group by Case having count(*) = 1)", dbname = tempfile(), file.format = list(header = T, row.names = F))
但是我收到此错误:
Error in sqliteExecStatement(con, statement, bind.data) :
RS-DBI driver: (error in statement: near "in": syntax error)
我在这里有什么明显的遗漏吗。非常感谢提前。