spark-dataframe - 使用 hadoop spark1.6 数据框计算中位数，平均值，无法启动数据库“metastore_db”

Question

spark-shell --packages com.databricks:spark-csv_2.11:1.2.0 1. 使用 SQLContext ~~~~~~~~~~~~~~~~~~~~~ 1. 导入 org.apache。 spark.sql.SQLContext 2. val sqlctx = new SQLContext(sc) 3. 导入 sqlctx._

val df = sqlctx.read.format("com.databricks.spark.csv").option("inferScheme","true").option("delimiter",";").option("header","true ").load("/user/cloudera/data.csv")
df.select(avg($"col1")).show() // 这很好用
sqlctx.sql("select percentile_approx(balance,0.5) as median from port_bank_table").show() or sqlctx.sql("select percentile(balance,0.5) as median from port_bank_table").show() // 两者都不是工作，得到以下错误

org.apache.spark.sql.AnalysisException：未定义的函数 percentile_approx；第 0 行 pos 0 在 org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:65) 在 org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2。申请（FunctionRegistry.scala:65）
使用 HiveContext ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 所以尝试使用配置单元上下文 scala> import org.apache.spark .sql.hive.HiveContext 导入 org.apache.spark.sql.hive.HiveContext

scala> val hivectx = new HiveContext(sc) 18/01/09 22:51:06 WARN metastore.ObjectStore: 无法获取数据库默认值，返回 NoSuchObjectException hivectx: org.apache.spark.sql.hive.HiveContext = org.apache .spark.sql.hive.HiveContext@5be91161

scala> 导入 hivectx._ 导入 hivectx._

getting the below error 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@be453c4, 
see the next exception for details.

score 0 · Accepted Answer

我在 Spark 聚合函数中找不到任何 percentile_approx、percentile 函数。似乎这个功能没有内置到 Spark DataFrames 中。有关更多信息，请遵循如何在 Spark 中计算 DataFrame 中列的百分比？我希望它会帮助你。

score 0 · Accepted Answer

I don't think so, it should work, for that you should save the table in 
dataFrame using saveAsTable. Then you will be able to run your query using 
HiveContext.

df.someDF.write.mode(SaveMode.Overwrite) 
              .format("parquet")
              .saveAsTable("Table_name")

# In my case "mode" is working as mode("Overwrite")

hivectx.sql("select avg(col1) as median from Table_name").show()

It will work.

spark-dataframe - 使用 hadoop spark1.6 数据框计算中位数，平均值，无法启动数据库“metastore_db”

2 回答 2

Related

Reference