问题标签 [spark-dataframe]

问问题

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

3700 问题

0 投票

1 回答

17154 浏览

scala - Replacing null values with 0 after spark dataframe left outer join

I have two dataframes called left and right.

Then, I join them to get the joined Dataframe. It is a left outer join. Anyone interested in the natjoin function can find it here.

https://gist.github.com/anonymous/f02bd79528ac75f57ae8

Since it is a left outer join, the real_labelVal column has nulls when user_uid is not present in right.

I want to replace the null values in the realLabelVal column with 1.0.

Currently I do the following:

I find the index of real_labelval column and use the spark.sql.Row API to set the nulls to 1.0. (This gives me a RDD[Row])
Then I apply the schema of the joined dataframe to get the cleaned dataframe.

The code is as follows:

Is there an elegant or efficient way to do this?

Goolging hasn't helped much. Thanks in advance.

2015-08-04T01:04:08.573

0 投票

0 回答

1190 浏览

apache-spark - Spark Dataframe - 组合数据帧的最佳方式

我目前使用 databricks 库将 CSV 文件加载到 Dataframes 中。

我正在寻找最好的通用方法来使用特定键对加载的数据帧进行组合，因为组合操作仅适用于 PairRDD。

我发现这篇文章为 Dataframes 实现了 cogroup 功能，但我想有一些不同的方法：

https://gist.github.com/ahoy-jon/b65754cde98cc48b9b38

请问你有没有遇到过这种情况？

谢谢。

apache-spark spark-dataframe

2015-08-04T10:11:49.223

0 投票

1 回答

13942 浏览

apache-spark - 将数据帧分组到列表中

我正在尝试对集合进行一些分析。我有一个示例数据集，如下所示：

订单.json

它只是一个字段，它是代表 ID 的数字列表。

这是我要运行的 Spark 脚本：

简而言之，创建expanded并且grouped很好，expanded是两个 ID 的所有可能集合的列表，其中两个 ID 在相同的原始集合中。grouped过滤掉与自己匹配的 ID，然后将所有唯一的 ID 对组合在一起，并为每个 ID 生成一个计数。的架构和数据样本grouped是：

所以，我的问题是：我现在如何对每个结果中的第一项进行分组，以便我有一个元组列表？对于上面的示例数据，我希望类似于以下内容：

正如您在我的脚本中看到的那样recs，我认为您应该首先在“item1”上执行 groupBy，这是每行中的第一项。但在那之后，您将得到这个 GroupedData 对象，该对象的操作非常有限。真的，您只需要进行 sum、avg 等聚合。我只想列出每个结果中的元组。

此时我可以轻松使用 RDD 函数，但这与使用 Dataframe 不同。有没有办法用数据框函数做到这一点。

apache-spark dataframe apache-spark-sql spark-dataframe

2015-08-06T20:00:58.883

0 投票

1 回答

1504 浏览

java - 如何从行集合创建数据框？

我想从解析的字符串 RDD 手动创建一个数据框。我已经有了我的 StructType，我可以从RowFactory.create(StructType[]). 我看到一个名为sqlContext.createDataFrame(RDD<Row>, StructType)RDD 和 StructType 的方法。那么如何将我的 Row 对象变成 RDD 呢？

java apache-spark spark-dataframe

2015-08-10T18:43:40.983

0 投票

8 回答

91148 浏览