4

我试图从一个大文件(> 百万行)中选择一个随机行,而不是选择任何重复项。如果有骗子,那么我想继续挑选更多,直到找到非骗子。

到目前为止我得到了什么:

@already_picked = []

def random_line
  chosen_line = nil
  chosen_line_number = nil
  File.foreach("OSPD4.txt").each_with_index do |line, number| 
    if rand < 1.0/(number+1)
      chosen_line_number = number
      chosen_line = line
    end
  end
  chosen_line
  if @already_picked.include(chosen_line_number)?
    # what here?
  else
    @already_picked << chosen_line_number
  end
end

100.times do |t|
  random_line
end

我不确定在if条款中该怎么做

4

4 回答 4

2

1 million lines isn't very much. if they avg 100 bytes/line, that's 100MB in memory. So do the simple thing and move on

File.readlines("file").sample(100)

If you start talking bigger than easily fits in memory, the next step is to do one pass over the file to record line positions, then just pull samples from that.

class RandomLine
  def initialize(fn)
    @file = File.open(fn,'r')
    @positions = @file.lines.inject([0]) { |m,l| m << m.last + l.size }.shuffle
  end

  def pick
    @file.seek(@positions.pop)
    @file.gets
  end
end
于 2013-04-14T20:57:18.267 回答
1

每次请求随机行时,您的方法可能会读取大量文件。更好的方法可能是读取整个文件一次并构建每行开始位置的表(这样您就不必将所有数据保存在内存中)。假设文件没有改变,那么您可以在此表中寻找一个随机位置并读取一行。快点。一种可能的实现:

class RandomLine
  def initialize(filename)
    @file = File.open(filename)
    @table = [0]
    @picked = []
    File.foreach(filename) do |line|
      @table << @table.last + line.size
    end
  end
  def pick
    return nil if @table.size == 0 # if no more lines, nil
    i = rand(@table.size) # random line
    @file.seek(@table[i]) # go to the line
    @table.delete_at(i)   # remove from the table
    line = @file.readline
    if @picked.include? line
      pick   # pick another line
    else
      @picked << line
      line
    end
  end
end

用法:

random_line = RandomLine.new("OSPD4.txt")
100.times do
  puts random_line.pick
end
于 2013-04-14T19:26:00.630 回答
1

While it's very noble to go to that much work to avoid reading the file into memory, a million lines isn't all that much. An alternative is to just try a simple solution and only go complex if it's actually slow in practice.

class RandomLine
  def initialize fn
    open(fn, 'r') { |f| @i, @lines = -1, f.readlines.shuffle }
  end

  def pick
    @lines[@i += 1]
  end
end

q = o = RandomLine.new '/etc/hosts'
puts q while q = o.pick
于 2013-04-14T20:07:16.837 回答
1

As reading file returns array of lines, you can just go with #sample method.

File.readlines("OSPD4.txt").sample(100).map{|line| line.chomp }
# using chomp to get rid of EOL
于 2013-04-14T20:41:34.143 回答