2
  • 齐柏林飞艇 0.6
  • 火花 1.6
  • SQL

我试图在一些推文中找到前 20 个出现的单词。filtered包含每条推文的单词数组。以下:

select explode(filtered) AS words from tweettable 

按照您的预期列出每个单词,但我想要的是计算所有推文中每个单词的计数,然后显示其中的前 20 个。以下工作,但我需要在 SQL 中执行此操作:

df.select(explode($"filtered").as("value"))
  .groupBy("value")
  .count()
  .sort(desc("count"))
  .show(20, false)

我尝试GROUP BYwords, filteredexplode(filtered)但都给出了错误。

4

2 回答 2

3

您可以subqueriesFROM语句中使用:

SELECT value, count(*) AS count
FROM (SELECT explode(filtered) AS value
      FROM tweettable) AS temp
GROUP BY value
ORDER BY count DESC
于 2017-04-16T09:22:49.590 回答
1

以下代码将为您提供完整的想法以实现您的期望。在 Spark(1.6) 中测试

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext.implicits._

val lst = List(Seq("Hello","Hyd","Hello","Mumbai"),Seq("Hello","Mumbai"),Seq("Hello","Delhi","Hello","Banglore"))
case class Tweets(filtered: Seq[String])
val df = sc.parallelize(lst).map(x=>Tweets(x)).toDF 

import org.apache.spark.sql.functions.{explode}
import org.apache.spark.sql.functions.count
df.select(explode($"filtered").as("value")).groupBy("value").agg(count("*").alias("cnt")).orderBy('cnt.desc).show(20,false)

或者,您可以使用窗口功能。

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext.implicits._

val lst = List(Seq("Hello","Hyd","Hello","Mumbai"),Seq("Hello","Mumbai"),Seq("Hello","Delhi","Hello","Banglore"))
case class Tweets(filtered: Seq[String])
val df = sc.parallelize(lst).map(x=>Tweets(x)).toDF 

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val w = org.apache.spark.sql.expressions.Window.orderBy('cnt.desc)

df.select(explode($"filtered").as("value")).groupBy("value").agg(count("*").alias("cnt")).withColumn("filteredrank", rank.over(w)).filter(col("filteredrank") <= 20).show()
于 2017-04-16T12:26:42.250 回答