1

尝试使用in readtext library(附带的quanteda library)解析超过 7000 个 txt 文件R,我收到以下警告。

警告消息:在(函数(...,deparse.level = 1):结果的列数不是向量长度的倍数(arg 2030)

如何确定哪个 txt 文件导致警告?

使用详细选项不会显示警告是否发生。为了您的信息,尝试解析两个文件我得到以下信息(b2w,如果我一次只解析 1 个文档,则不会显示警告)。

从 /Users/OS/surfdrive/Competenties/Data-step-1/BinnenlandsBestuur/1982/9-12/Office Lens 20170308-102311.jpg.txt 读取文本从 /Users/OS/surfdrive/Competenties/Data-step- 读取文本1/BinnenlandsBestuur/1983/Office Lens 20170308-103518.jpg.txt,使用 glob 模式...阅读(txt)文件:Office Lens 20170308-102311.jpg.txt,使用 glob 模式...阅读(txt)文件: Office Lens 20170308-103518.jpg.txt 读取 2 个文档。警告消息:1:在(函数(...,deparse.level = 1):结果的列数不是向量长度的倍数(arg 2)2:在if(verbosity == 2&nchar(msg) > 70) pad <- paste0("\n", pad) : 条件长度 > 1 并且只使用第一个元素

Session info
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] C/C/C/C/C/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tm.plugin.webmining_1.3 XML_3.98-1.7            readtext_0.50           RoogleVision_0.0.1.1   
 [5] outliers_0.14           stringdist_0.9.4.4      ltm_1.0-0               polycor_0.7-9          
 [9] msm_1.6.4               MASS_7.3-47             psych_1.7.5             WriteXLS_4.0.0         
[13] plyr_1.8.4              metafor_2.0-0           Matrix_1.2-9            metaSEM_0.9.14         
[17] OpenMx_2.7.12           xlsx_0.5.7              xlsxjars_0.6.1          rJava_0.9-8            
[21] readxl_1.0.0            quanteda_0.9.9-65       koRpus.lang.nl_0.01-3   koRpus_0.11-1          
[25] sylly_0.1-1             jsonlite_1.5            httr_1.2.1             

loaded via a namespace (and not attached):
 [1] sylly.ru_0.1-1      splines_3.4.0       ellipse_0.3-8       RcppParallel_4.3.20 shiny_1.0.3        
 [6] sylly.it_0.1-1      expm_0.999-2        sylly.es_0.1-1      cellranger_1.1.0    slam_0.1-40        
[11] yaml_2.1.14         backports_1.1.0     lattice_0.20-35     digest_0.6.12       googleAuthR_0.5.1  
[16] colorspace_1.3-2    htmltools_0.3.6     httpuv_1.3.3        tm_0.7-1            devtools_1.13.2    
[21] xtable_1.8-2        mvtnorm_1.0-6       scales_0.4.1        tibble_1.3.3        openssl_0.9.6      
[26] ggplot2_2.2.1       withr_1.0.2         lazyeval_0.2.0      NLP_0.1-10          mnormt_1.5-5       
[31] RJSONIO_1.3-0       survival_2.41-3     magrittr_1.5        mime_0.5            memoise_1.1.0      
[36] evaluate_0.10       boilerpipeR_1.3     nlme_3.1-131        foreign_0.8-67      rsconnect_0.8      
[41] tools_3.4.0         data.table_1.10.4   stringr_1.2.0       munsell_0.4.3       compiler_3.4.0     
[46] rlang_0.1.1         grid_3.4.0          RCurl_1.95-4.8      bitops_1.0-6        rmarkdown_1.5      
[51] gtable_0.2.0        curl_2.6            R6_2.2.2            sylly.en_0.1-1      knitr_1.16         
[56] fastmatch_1.1-0     sylly.fr_0.1-1      rprojroot_1.2       stringi_1.1.5       parallel_3.4.0     
[61] sylly.de_0.1-1      Rcpp_0.12.11 

谢谢你,彼得

PS。如果此信息不足,我将在 github 页面上发布一个可重现的示例。

4

1 回答 1

1

您可以使用它purrr来查找与您想要的不匹配的列。

首先,让我们使用一个与其他三个名称不同的文件创建一些演示数据......

library(tidyverse)
library(purrr)
library(stringr)
old_wd <- getwd()
setwd(tempdir())

demo_data <- tibble(x = rnorm(327),
                    y = rnorm(327),
                    z = rnorm(327))

write_csv(demo_data, "demo1.csv")
write_csv(demo_data, "demo2.csv")
write_csv(demo_data, "demo3.csv")

bad_data <-
  tibble(
    x = rnorm(327),
    y = rnorm(327),
    z = rnorm(327),
    extra_column = rnorm(327)
  )

write_csv(bad_data, "demo4.csv")

现在定义列名应该是什么。对于此示例,正确的名称是x, y, 和z,

correct_names <- c("x", "y", "z")

此函数将读取 csv 并检查所有名称是否与correct_names.

get_csv_names <- function(path){
  c(path, all(names(read_csv(path)) == correct_names))
}

我假设您要处理工作目录中的所有 csv 文件。否则你会想改变files我下面的值......

files <- list.files() %>% 
  tbl_df() %>% 
  filter(str_detect(value, ".csv")) %>% 
  pull()

现在只需映射files到 function get_csv_names。请注意 demo4.csv 的值为FALSE,这意味着它的列名与您在correct_names...中指定的不匹配

map(files, get_csv_names)

# [[1]]
# [1] "demo1.csv" "TRUE"     
# 
# [[2]]
# [1] "demo2.csv" "TRUE"     
# 
# [[3]]
# [1] "demo3.csv" "TRUE"     
# 
# [[4]]
# [1] "demo4.csv" "FALSE"  

由于我们在开始时更改了工作目录,所以最好在最后重置它。

setwd(old_wd)
于 2017-08-02T12:56:34.077 回答