ruby-on-rails - Ruby 并行 csv 导入

Question

我正在导入巨大的 csv 文件，我想将其拆分，这样导入会更快（我没有直接导入到 db，我有一些计算）。代码如下所示：

def import_shatem
    require 'csv'





    CSV.foreach("/#{Rails.public_path}/uploads/hshatem2.csv", {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row}) do | row |

      @eur_cur = Currency.find_by_currency_name("EUR")
      abrakadabra = row[0].to_s()
      (ename,esupp) = abrakadabra.split(/_/)
      eprice = row[6].to_f / @eur_cur.currency_value
      eqnt = /(\d+)/.match(row[1])[0].to_f


        if ename.present? && ename.size>3
        search_condition = "*" + ename.upcase + "*"     

        if esupp.present?
          #supplier = @suppliers.find{|item| item['SUP_BRAND'] =~ Regexp.new(".*#{esupp}.*") }
          supplier = Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
          logger.warn("!!! *** supp !!!")

        end

        if supplier.present?

          @search = ArtLookup.find(:all, :conditions => ['MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE) and ARL_KIND = 1', search_condition.gsub(/[^0-9A-Za-z]/, '')])
          @articles = Article.find(:all, :conditions => { :ART_ID => @search.map(&:ARL_ART_ID)})
          #@art_concret = @articles.find_all{|item| item.ART_ARTICLE_NR.gsub(/[^0-9A-Za-z]/, '').include?(ename.gsub(/[^0-9A-Za-z]/, '')) }

          @aa = @articles.find{|item| item['ART_SUP_ID']==supplier.SUP_ID} #| @articles
          if @aa.present?
            @art = Article.find_by_ART_ID(@aa)
          end

          if @art.present?
            #require 'time_diff'
            #cur_time = Time.now.strftime('%Y-%m-%d %H:%M')
            #time_diff_components = Time.diff(@art.datetime_of_update, Time.parse(cur_time))
            limit_time = Time.now + 3.hours
            if  (@art.PRICEM.to_f >= eprice.to_f || @art.PRICEM.blank? ) #&& @art.datetime_of_update >= limit_time) 
              @art.PRICEM = eprice
              @art.QUANTITYM = eqnt
              @art.datetime_of_update = DateTime.now
              @art.save
            end
          end

        end     
      end
    end
  end

我怎么能平行呢？并获得更快的导入？

score 1 · Accepted Answer

查看 Gem smarter_csv！它可以分块读取 CSV 文件，然后您可以创建 Sidekiqjobs 来处理这些块并将其插入数据库。

https://github.com/tilo/smarter_csv

score 0 · Accepted Answer

查看代码，瓶颈将是数据库查询。并行运行它不会解决这个问题。相反，让我们看看我们是否可以提高效率。

最大的问题可能是文章搜索。它在内存中进行多个查询和搜索。我们会讲到最后。

Currency.find_by_currency_name总是一样的。从循环中提取 if。它不太可能成为瓶颈，但它会有所帮助。并且，假设currency_name是一列Currency，我们可以通过获取单个值而不是加载整个记录来节省一点时间pick。

  def currency_value
    @currency_value ||= Currency.where(currency_name: "EUR").pick(:currency_value)
  end

同样，Supplier.where如果 CSV 将包含许多重复值，则可以从缓存中受益。使用Memoist缓存返回值。

  extend Memoist

  private def find_supplier_for_esupp(esupp)
    return if esupp.blank?
    Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
  end
  memoize :find_supplier_for_esupp

%term%不会使用普通的 B-Tree 索引，因此搜索可能会很慢，具体取决于供应商表的大小。如果您使用的是 PostgreSQL，则可以使用trigram index加速此查询。

add_index :suppliers, :SUP_BRAND, using: 'gin', opclass: :gin_trgm_ops

最后，文章搜索可能是最大的瓶颈。它正在查询 ArtLookup，加载所有记录，将它们全部扔到一个列中。然后搜索Article，加载内存中的所有，在内存中过滤，最后一次搜索Article。

假设在模型中正确设置了 Article 和 ArtLookup 之间的关系，则可以将其缩减为一个查询。

  art = Article
    .joins(:art_lookups)
    .merge(
      ArtLookup
        .where(ARL_KIND: 1)
        .where(
          'MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE)',
          search_condition
        )
    )
    .where(
      ART_SUP_ID: supplier.SUP_ID
    )
    .first

那应该要快得多。

总而言之，还有其他一些改进，例如提前返回以避免所有嵌套的 if。

require 'csv'

class ShatemImporter
  extend Memoist

  # Cache the possibly expensive query to find suppliers.
  private def find_supplier_for_esupp(esupp)
    Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
  end
  memoize :find_supplier_for_esupp

  # Cache the currency value query outside the loop.
  private def currency_value
    @currency_value ||= Currency.find_by(currency_name: "EUR").currency_value
  end

  def import_shatem(csv_file)
    CSV.foreach(
      csv_file,
      {
        encoding: 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row
      }
    ) do |row|
      (ename,esupp) = row[0].to_s().split(/_/)
      eprice = row[6].to_f / currency_value
      eqnt = row[1].match(/(\d+)/).first.to_f

      next if ename.blank? || ename.size < 4
      next if esupp.blank?
      
      supplier = find_supplier_for_esupp(esupp)      
      next if !supplier

      article = Article
        .joins(:art_lookups)
        .merge(
          ArtLookup
            .where(ARL_KIND: 1)
            .where(
              'MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE)',
              "*#{ename.upcase}*"     
            )
        )
        .where(
          ART_SUP_ID: supplier.SUP_ID
        )
        .first
      next if !article

      if art.PRICEM.blank? || art.PRICEM.to_f >= eprice.to_f
        art.update!(
          PRICEM: eprice,
          QUANTITYM: eqnt,
          datetime_of_update: DateTime.now
        )
      end
    end
  end
end

这是用 Rails 6 编写的，您的代码看起来像 Rails 2，并且未经测试。但希望它能为您提供优化的途径。

ruby-on-rails - Ruby 并行 csv 导入

2 回答 2

Related

Reference