2

I am looking for a gem that will split a CSV dataset into smaller datasets for training and test on a machine learning system. There is a package in R which will do this, based on random sampling; but my research has not turned up anything in Ruby. The reason I wanted to do this in Ruby is that the original dataset is quite large, e.g. 17 million rows or 5.5 gig. R expects to load the entire dataset into memory. Ruby is far more flexible. Any suggestions would be appreciated.

4

4 回答 4

1

This will partition your original data to two files without loading it all into memory:

require 'csv'

sample_perc = 0.75

CSV.open('sample.csv','w') do |sample_out|
  CSV.open('test.csv','w') do |test_out|
    CSV.foreach('alldata.csv') do |row|
      (Random.rand < sample_perc ? sample_out : test_out) << row
    end
  end
end
于 2013-03-30T21:39:15.357 回答
0

CSV is built-in to ruby, you don't need any gem to do this:

require 'csv'

csvs = (1..10).map{|i| CSV.open("data#{i}.csv", "w")}
CSV.foreach("data.csv") do |row|
  csvs.sample << row
end

CSV.foreach will not load the entire file into memory.

于 2013-03-30T11:39:58.450 回答
0

You will probably want to write your own code for this, based around Ruby's bundled csv gem. There are lots of possibilities for how to split the data, and the requirement to do this efficiently over such a large data set is quite specialist, whilst also not requiring that much code.

However, you might have some luck looking through the many sub-features of ai4r

I've not yet found many mature pre-packaged machine learning algorithms for Ruby (that you might also find in R or in Python's scikitlearn). No random forests, gbm etc - or if there are, they are difficult to find. There is a Ruby interface to R. Also wrappers for ATLAS. I have tried neither.

I do make use of ruby-fann (neural nets) , and the gem narray is your friend for large numerical data sets.

于 2013-03-30T13:58:43.633 回答
0

you can use the smarter_csv Ruby gem and set the chunk_size to the desired sample size, and then save the chunks as Resque jobs , which can then be processed in parallel.

https://github.com/tilo/smarter_csv

see examples on that GitHub page.

于 2013-04-13T18:46:54.560 回答