我是生物学家,对 Python 非常陌生,在此之前,我学了一点 R。
所以我有一个非常大的文本文件(3 GB,在 R 中太大而无法处理),所有值都用逗号分隔,但扩展名是 .txt(我不知道这是否是必要的信息)。我想做的是:
将其作为对象读入python,相当于R中的数据框,去掉中间的列减小对象的大小将其写入txt文件
把剩下的带到 R。
如果你能帮助我,我会很高兴。谢谢你
没有必要先进入python。你的问题看起来很像这个问题。标记为正确答案的答案迭代地读取大文件,并创建一个新的较小的文件。其他好的替代方案是使用sqlite
和sqdf
包,或使用ff
包。最后一种方法效果特别好,因为与行数相比,列数很小。
This will take minimal memory as it does not load the whole file at once.
import csv
with open('in.txt', 'rb') f_in, open('out.csv', 'wb') as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
for row in reader:
# keep first two columns and last three columns
writer.writerow(row[:2] + row[-3:])
Note: If using Python 3 change the file modes to 'r'
and 'w'
, respectively.
每个CRAN(新功能和错误修复重新:开发)新的开发版本 3.0.0 应该允许 R 使用页面文件/交换。在 Windows 中,您需要将 R_MAX_MEM_SIZE 设置为适当大的值。
If you insist on using a preprocessing step, using the linux command tools is a really good and fast option. If you use Linux, these tools are already installed, under Windows you'll need to first install MinGW or Cygwin. This SO question already provides some nice pointers. In essence you use the awk
tool to iteratively process the text file, creating an output text file as you go. Copying form the accepted answer of the SO question I linked:
awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt
This read the file, grabs the eight column, and dumps it to a file. See the answer for more details.