jruby - HBase shell 扫描字节到字符串的转换

Question

我想扫描 hbase 表并将整数视为字符串（而不是它们的二进制表示）。我可以进行转换，但不知道如何使用 hbase shell 中的 Java API 编写扫描语句：

org.apache.hadoop.hbase.util.Bytes.toString(
  "\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65".to_java_bytes)

 org.apache.hadoop.hbase.util.Bytes.toString("Hello HBase".to_java_bytes)

我会很高兴有扫描示例，获取搜索二进制数据（long's）并输出正常字符串。我使用的是 hbase shell，而不是 JAVA。

score 13 · Accepted Answer

HBase 将数据存储为字节数组（无类型）。因此，如果您执行表扫描，数据将以通用格式（转义的十六进制字符串）显示，例如：
“\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65” -> Hello HBase

如果要从序列化字节数组中取回键入的值，则必须手动执行此操作。您有以下选择：

Java 代码 (Bytes.toString(...))
破解to_string函数$HBASE/HOME/lib/ruby/hbase/table.rb：将toStringBinary替换为toInt用于非元表
编写一个 get/scan JRuby 函数，将字节数组转换为适当的类型

既然你想要它 HBase shell，那么考虑最后一个选项：
创建一个文件 get_result.rb ：

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Result;
import java.util.ArrayList;

# Simple function equivalent to scan 'test', {COLUMNS => 'c:c2'}
def get_result()
  htable = HTable.new(HBaseConfiguration.new, "test")
  rs = htable.getScanner(Bytes.toBytes("c"), Bytes.toBytes("c2"))
  output = ArrayList.new
  output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL"
  rs.each { |r| 
    r.raw.each { |kv|
      row = Bytes.toString(kv.getRow)
      fam = Bytes.toString(kv.getFamily)
      ql = Bytes.toString(kv.getQualifier)
      ts = kv.getTimestamp
      val = Bytes.toInt(kv.getValue)
      output.add " #{row} \t\t\t\t\t\t column=#{fam}:#{ql}, timestamp=#{ts}, value=#{val}"
    }
  }
  output.each {|line| puts "#{line}\n"}
end

将其加载到 HBase shell 中并使用它：

require '/path/to/get_result'
get_result

注意：根据您的需要修改/增强/修复代码

score 3 · Accepted Answer

为了完整起见，事实证明该调用Bytes::toStringBinary提供了您在 HBase shell 中获得的十六进制转义序列：

\x0B\x2_SOME_ASCII_TEXT_\x10\x00...

然而，Bytes::toString将尝试反序列化为假设为 UTF8 的字符串，这看起来更像：

\u8900\u0710\u0115\u0320\u0000_SOME_UTF8_TEXT_\u4009...

score 2 · Accepted Answer

您可以将 scan_counter 命令添加到 hbase shell。

第一的：

添加到 /usr/lib/hbase/lib/ruby/hbase/table.rb （在扫描功能之后）：

#----------------------------------------------------------------------------------------------
  # Scans whole table or a range of keys and returns rows matching specific criterias with values as number
  def scan_counter(args = {})
    unless args.kind_of?(Hash)
      raise ArgumentError, "Arguments should be a hash. Failed to parse #{args.inspect}, #{args.class}"
    end

    limit = args.delete("LIMIT") || -1
    maxlength = args.delete("MAXLENGTH") || -1

    if args.any?
      filter = args["FILTER"]
      startrow = args["STARTROW"] || ''
      stoprow = args["STOPROW"]
      timestamp = args["TIMESTAMP"]
      columns = args["COLUMNS"] || args["COLUMN"] || get_all_columns
      cache = args["CACHE_BLOCKS"] || true
      versions = args["VERSIONS"] || 1
      timerange = args[TIMERANGE]

      # Normalize column names
      columns = [columns] if columns.class == String
      unless columns.kind_of?(Array)
        raise ArgumentError.new("COLUMNS must be specified as a String or an Array")
      end

      scan = if stoprow
        org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes, stoprow.to_java_bytes)
      else
        org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes)
      end

      columns.each { |c| scan.addColumns(c) }
      scan.setFilter(filter) if filter
      scan.setTimeStamp(timestamp) if timestamp
      scan.setCacheBlocks(cache)
      scan.setMaxVersions(versions) if versions > 1
      scan.setTimeRange(timerange[0], timerange[1]) if timerange
    else
      scan = org.apache.hadoop.hbase.client.Scan.new
    end

    # Start the scanner
    scanner = @table.getScanner(scan)
    count = 0
    res = {}
    iter = scanner.iterator

    # Iterate results
    while iter.hasNext
      if limit > 0 && count >= limit
        break
      end

      row = iter.next
      key = org.apache.hadoop.hbase.util.Bytes::toStringBinary(row.getRow)

      row.list.each do |kv|
        family = String.from_java_bytes(kv.getFamily)
        qualifier = org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getQualifier)

        column = "#{family}:#{qualifier}"
        cell = to_string_scan_counter(column, kv, maxlength)

        if block_given?
          yield(key, "column=#{column}, #{cell}")
        else
          res[key] ||= {}
          res[key][column] = cell
        end
      end

      # One more row processed
      count += 1
    end

    return ((block_given?) ? count : res)
  end

  #----------------------------------------------------------------------------------------
  # Helper methods

  # Returns a list of column names in the table
  def get_all_columns
    @table.table_descriptor.getFamilies.map do |family|
      "#{family.getNameAsString}:"
    end
  end

  # Checks if current table is one of the 'meta' tables
  def is_meta_table?
    tn = @table.table_name
    org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::META_TABLE_NAME) || org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::ROOT_TABLE_NAME)
  end

  # Returns family and (when has it) qualifier for a column name
  def parse_column_name(column)
    split = org.apache.hadoop.hbase.KeyValue.parseColumn(column.to_java_bytes)
    return split[0], (split.length > 1) ? split[1] : nil
  end

  # Make a String of the passed kv
  # Intercept cells whose format we know such as the info:regioninfo in .META.
  def to_string(column, kv, maxlength = -1)
    if is_meta_table?
      if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
        hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
        return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
      end
      if column == 'info:serverstartcode'
        if kv.getValue.length > 0
          str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
        else
          str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
        end
        return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
      end
    end

    val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getValue)}"
    (maxlength != -1) ? val[0, maxlength] : val
  end


  def to_string_scan_counter(column, kv, maxlength = -1)
    if is_meta_table?
      if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
        hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
        return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
      end
      if column == 'info:serverstartcode'
        if kv.getValue.length > 0
          str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
        else
          str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
        end
        return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
      end
    end

    val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toLong(kv.getValue)}"
    (maxlength != -1) ? val[0, maxlength] : val
  end

第二：

将以下文件添加到 /usr/lib/hbase/lib/ruby/shell/commands/ 中：scan_counter.rb

  #
# Copyright 2010 The Apache Software Foundation
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

module Shell
  module Commands
    class ScanCounter < Command
      def help
        return <<-EOF
Scan a table with cell value that is long; pass table name and optionally a dictionary of scanner
specifications.  Scanner specifications may include one or more of:
TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
or COLUMNS. If no columns are specified, all columns will be scanned.
To scan all members of a column family, leave the qualifier empty as in
'col_family:'.

Some examples:

  hbase> scan_counter '.META.'
  hbase> scan_counter '.META.', {COLUMNS => 'info:regioninfo'}
  hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
  hbase> scan_counter 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
  hbase> scan_counter 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}

For experts, there is an additional option -- CACHE_BLOCKS -- which
switches block caching for the scanner on (true) or off (false).  By
default it is enabled.  Examples:

  hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}
EOF
      end

      def command(table, args = {})
        now = Time.now
        formatter.header(["ROW", "COLUMN+CELL"])

        count = table(table).scan_counter(args) do |row, cells|
          formatter.row([ row, cells ])
        end

        formatter.footer(now, count)
      end
    end
  end
end

最后

将函数 scan_counter 添加到 /usr/lib/hbase/lib/ruby/shell.rb。

用这个替换当前函数：（您可以通过以下方式识别它：'DATA MANIPULATION COMMANDS'，）

Shell.load_command_group(
  'dml',
  :full_name => 'DATA MANIPULATION COMMANDS',
  :commands => %w[
    count
    delete
    deleteall
    get
    get_counter
    incr
    put
    scan
    scan_counter
    truncate
  ]
)

jruby - HBase shell 扫描字节到字符串的转换

3 回答 3

Related

Reference