我们在一个平面文件中有 160 万条记录。每条记录包含三个或四个少于 100 个字符的短字符串。
我们只需要这些记录中的 800K。我们将这些记录写入 Mongo 集合。其他 800K 被忽略。
处理文件大约需要 15 分钟,这意味着我们处理大约 1.67K 记录/秒。这是预期的性能,还是该过程应该更快(例如,5K 记录/秒、10K 记录/秒)?
下面的代码(@skip 是大约 800K 应用程序 ID 的哈希)。
def updateApplicationDeviceTypes(dir, limit)
puts "Updating Application Data (Pass 3 - Device Types)..."
file = File.join(dir, '/application_device_type')
cols = getColumns(file)
device_type_id_col = cols[:device_type_id]
update = Proc.new do |id, group|
@applications_coll.update(
{ "itunes_id" => id },
{ :$set => { "devices" => group } }
# If all records for one id aren't adjacent, you'll need this instead
#{ :$addToSet => { "devices" => { :$each => group } } }
) unless !id or @skip[id.intern]
end
getValue = Proc.new { |r| r[device_type_id_col] }
batchRecords(file, cols[:application_id], update, getValue, limit)
end
# result to an array, before calling "update" on the array/id
def batchRecords(filename, idCol, update, getValue, limit=nil)
current_id = nil
current_group = []
eachRecord(filename, limit) do |r|
id = r[idCol]
value = getValue.call(r)
if id == current_id and !value.nil?
current_group << value
else
update.call(current_id, current_group) unless current_id.nil?
current_id = id
current_group = value.nil? ? [] : [value]
end
end
# Since the above is only called once for each row, we still
# have one group to update.
update.call(current_id, current_group)
end