apache-spark - 十进制数据类型无法在 spark 和 Hive 中正确存储值

Question

我在存储十进制数据类型时遇到问题，不确定这是错误还是我做错了什么

文件中的数据如下所示

Column1 column2 column3
steve   100     100.23
ronald  500     20.369
maria   600     19.23

当我使用csv阅读器推断火花中的模式时，它将column3的数据类型作为字符串，所以我将其转换为十进制并将其保存为表。

现在，当我访问表格时，它以以下方式显示输出，消除了小数

Column1 column2 column3
steve   100     100
ronald  500     20
maria   600     19

我还在 Hive 中测试了同样的事情，方法是创建一个以 column3 为十进制的本地表，并用数据加载它，同样它没有将它们存储为十进制。

在这方面的任何帮助将不胜感激。

这是上面的代码

在 spark 文件的 schema

root
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_SEQ_ID: integer (nullable = true)
 |-- DEST_CITY_MARKET_ID: integer (nullable = true)
 |-- DEST string: string (nullable = true)
 |-- DEST_CITY_NAME: string (nullable = true)
 |-- DEST_STATE_ABR: string (nullable = true)
 |-- DEST_STATE_FIPS: integer (nullable = true)
 |-- DEST_STATE_NM: string (nullable = true)
 |-- DEST_WAC: integer (nullable = true)
 |-- DEST_Miles: double (nullable = true)

代码

from pyspark import SparkContext
sc =SparkContext()

from pyspark.sql.types import *
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

Data=sqlContext.read.format("com.databricks.spark.csv").options(header="true").options(delimiter=",").options(inferSchema="true").load("s3://testbucket/Data_test.csv")

Data1=Data.withColumnRenamed('DEST string','DEST_string')

Data2 =Data1.withColumn('DEST_Miles',Data1.DEST_Miles.cast('Decimal'))

Data2.saveAsTable('Testing_data', mode='overwrite',path='s3://bucketname/Testing_data')

转换为十进制后的架构

root
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_SEQ_ID: integer (nullable = true)
 |-- DEST_CITY_MARKET_ID: integer (nullable = true)
 |-- DEST string: string (nullable = true)
 |-- DEST_CITY_NAME: string (nullable = true)
 |-- DEST_STATE_ABR: string (nullable = true)
 |-- DEST_STATE_FIPS: integer (nullable = true)
 |-- DEST_STATE_NM: string (nullable = true)
 |-- DEST_WAC: integer (nullable = true)
 |-- DEST_Miles: decimal (nullable = true)

对于蜂巢

create table Destination(
        DEST_AIRPORT_ID int,
        DEST_AIRPORT_SEQ_ID int,
        DEST_CITY_MARKET_ID int,
        DEST string,
        DEST_CITY_NAME string,
        DEST_STATE_ABR string,
        DEST_STATE_FIPS string,
        DEST_STATE_NM string,
        DEST_WAC int,
        DEST_Miles Decimal(10,0)
      );
INSERT INTO TEST_DATA SELECT * FROM TESTING_data;

如果您还需要更多信息，请告诉我。

谢谢，谢谢

score 2 · Accepted Answer

DECIMAL在 Hive V0.12 中表示“大浮点”。就像 Oracle 中的 NUMBER(38)。

但在后来的版本中发生了重大变化，DECIMAL 没有任何比例/精度规范现在意味着“一个大整数”。就像 Oracle 中的 NUMBER(10,0) 一样。

参考

Hive 语言手册/数据类型
cwiki.apache.org 中某处标有“Hive 十进制精度/比例支持”的冗长 PDF 文档

底线：您必须明确定义所需的位数，这正是几十年前 ANSI SQL 标准所期望的。例如，DECIMAL(15,3)将容纳整数部分中的 12 位数字 + 小数部分中的 3 位数字（即 15 位数字和任意位置的逗号）。

score 1 · Accepted Answer

Spark 和 Hive 的 Decimal 类型的默认精度为 10，小数位数为零。这意味着如果您不指定比例，小数点后将没有数字。

score 0 · Accepted Answer

该文件具有不同的分隔符（我认为是制表符）并且您正在使用“，”读取文件。

是的，它转换为字符串，但您不应该丢失数据。尝试这个：

>>> lines = spark.read.options( delimiter='\t', header='true').csv("/home/kiran/km/km_hadoop/data/data_tab_sep")
>>> lines.show()
+-------+-------+-------+
|Column1|column2|column3|
+-------+-------+-------+
|  steve|    100| 100.23|
| ronald|    500| 20.369|
|  maria|    600|  19.23|
+-------+-------+-------+

>>> lines.printSchema()
root
 |-- Column1: string (nullable = true)
 |-- column2: string (nullable = true)
 |-- column3: string (nullable = true)

您可以转换为 DoubleType，如下所示。（注意：对于您的情况，您不需要它，因为您正在写信给 FS）

>>> from pyspark.sql.types import DoubleType
>>> lines.select(lines["column1"], lines["column2"], lines["column3"].cast(DoubleType())).printSchema()
root
 |-- column1: string (nullable = true)
 |-- column2: string (nullable = true)
 |-- column3: double (nullable = true)

score 0 · Accepted Answer

我在从 oracle 读取数据时遇到了同样的问题，我可以通过强制转换来解决这个问题

joinedDF.col("START_EPOCH_TIME").cast("string")

apache-spark - 十进制数据类型无法在 spark 和 Hive 中正确存储值

4 回答 4

Related

Reference