“databricks”的相关标签问题

0 投票

2 回答

9664 浏览

date - 使用 Spark SQL 从 ISO 8601 解析日期时间

想要这样做，但反过来。

我date的 s 是这种格式YYYY-MM-DDThh:mm:ss，我想要两列YYYY-MM-DD，hh:mm如果我愿意，我可以连接某些查询。

使用时出现错误convert()；我认为 Spark SQL 目前不支持此功能。

当我使用date(datetime)ortimestamp(datetime)时，我会返回所有空值。然而，minute(datetime)和hour(datetime)工作。

目前，使用这个

这显然效率不高。

2016-08-23T18:45:55.993

0 投票

1 回答

1113 浏览

scala - spark：模式更改——如果存在，则转换和过滤列上的数据框；如果没有就不要

我使用的数据发生了架构更改。对于结果数据帧，它结合了新旧数据，我想转换和过滤的列曾经不存在于旧数据中。它不是由“null”填充的。我想尽可能地对该列进行转换和过滤，即每当列退出时，我想对其进行转换和过滤；对于没有此类列的早期数据，我将保留每一行。

问题是以下代码会导致java.lang.NullPointerException，因为早期的数据没有“ip”列。

上面的“firstIp”函数只是一个从数组中获取第一个IP地址的udf；它由定义val firstIp = udf[String, String](_.split(",")(0))。我不想按模式将数据分成两部分——那些有“ip”列的和没有的……但是如果不这样分割数据，我的目标可以实现吗？

scala apache-spark apache-spark-sql spark-dataframe databricks

2016-08-26T16:20:23.097

0 投票

3 回答

63694 浏览

scala - 在 Spark 数据框中爆炸嵌套结构

我正在研究一个 Databricks 示例。数据框的架构如下所示：

在示例中，他们展示了如何将员工列分解为 4 个附加列：

我将如何对部门列做类似的事情（即向数据框中添加两个名为“id”和“name”的附加列）？这些方法并不完全相同，我只能弄清楚如何使用以下方法创建一个全新的数据框：

如果我尝试：

我收到警告和错误：

scala apache-spark apache-spark-sql distributed-computing databricks

2016-09-01T15:41:19.603

0 投票

0 回答

106 浏览

scala - 库不在 intellij 中的 sbt 中解析，而是从命令行解析和编译

下面的 sbt 文件不能从 intelliJ Idea 中解析 spark-xml databricks 包，其中 as 可以在命令行中正常工作

sbt 设置为 bundled 和 other 都指向本地安装的 sbt，但无论哪种方式都不起作用。

下面的包从命令行解析并完美运行

scala intellij-idea apache-spark spark-dataframe databricks

2016-09-03T08:41:35.027

0 投票

1 回答

213 浏览

apache-spark - spark是否在单个工作人员中处理大文件

当使用 Apache Spark 处理一个大文件时，例如，sc.textFile("somefile.xml")它是否将其拆分为跨执行器的并行处理，或者将其作为单个执行器中的单个块处理？
使用数据框时implicit XMLContext，Databricks 是否为此类大型数据集处理预先构建了任何优化？

apache-spark spark-dataframe databricks

2016-09-03T09:21:53.890

0 投票

1 回答

1477 浏览

scala - Writing Spark RDD as text file to S3 bucket

I'm attempting to save a Spark RDD as a gzipped text file (or several text files) to an S3 bucket. The S3 bucket is mounted to dbfs. I'm trying to save the file using the following:

But when trying this, I keep getting the error:

However, I do see a few files written to the S3 bucket. I've also tried using rddDataset.repartition(1).saveAsTextFile("/mnt/mymount/myfolder/"), as advised here, but this ended in the same error.

This appears to be similar to this question, so maybe the errors are due to null values in my RDD? But when I try val newRDD = rddDataset.map(line => line).filter(x => x!= null).filter(x => x!=" ").filter(x => x!="") and try to save this RDD, I get the same error.

Additionally, rddDataset.count() throws a similar error. I'm creating rddDataset from a dataframe, which displays all its rows just fine. However, I can reproduce the java.lang.NullPointerException if I convert my original dataframe to an RDD:

I've provided one of the stack traces below:

Also, when I open up the information tab for the stage after running rddDataset.repartition(200).saveAsTextFile(/mnt/mymount/myfolder)I can find the error details:

scala apache-spark rdd spark-dataframe databricks

2016-09-04T22:22:17.420

0 投票

1 回答

1431 浏览

apache-spark - Apache Spark 无法查看输出

我刚刚开始学习 Apache Spark。我正在尝试打印链接的输出，但由于某种原因它没有显示。我也尝试过 links.collect() ， display(links) 但它们都不起作用。任何帮助，将不胜感激。

第二个图像的完整堆栈跟踪：

apache-spark pyspark databricks

2016-09-05T15:45:15.713

0 投票

0 回答

585 浏览

scala - 在 apache spark 中处理 DateType 的空值

无法处理 apache spark 中日期列的空值。我尝试将 null 替换为空字符串以及 null 值。

我得到的错误是 java.text.ParseException: Unparseable date: ""

我正在使用 csv-spark，databricks:spark-redshift_2.11 2.0.1

代码

var originalDataFrame = sqlContext.load( "com.databricks.spark.csv", schema = sourceSchema, Map("path" -> filePath, "header" -> "false", "codec"->"org.apache.hadoop.io.compress.GzipCodec", "delimiter"->"|", "dateFormat" -> dateFormat, "nullValue"->"" ))

提前致谢。

scala apache-spark amazon-redshift databricks

2016-09-09T10:03:30.460

0 投票

1 回答

393 浏览

apache-spark - Spark 2.0 - Databricks xml阅读器输入路径不存在

我正在尝试使用 Databricks XML 文件阅读器 api。

示例代码：

如果我直接给出文件路径，它会寻找一些仓库目录。所以我设置了spark.sql.warehouse.dir选项，但现在它抛出输入路径不存在。

其实是在项目根目录下找的，为什么要找项目根目录呢？

apache-spark apache-spark-sql databricks

2016-09-17T16:04:54.837

0 投票

1 回答

1750 浏览

apache-spark - Spark com.databricks.spark.csv 无法使用 node-snappy 加载 snappy 压缩文件

我在 S3 上有一些使用 snappy 压缩算法（使用node-snappy包）压缩的 csv 文件。我喜欢使用 spark 处理这些文件，com.databricks.spark.csv但我一直收到无效的文件输入错误。

代码：

错误信息：

16/09/24 21:57:25 WARN TaskSetManager：在阶段 0.0 中丢失任务 0.0（TID 0，ip-10-0-32-5.ec2.internal）：java.lang.InternalError：无法解压缩数据。输入无效。在 org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239) 在 org.apache.hadoop 的 org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method)。 io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101)在 org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) 在 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) 在 org.apache.hadoop.util.LineReader.readLine （LineReader.java：

apache-spark pyspark snappy databricks apache-spark-2.0

2016-09-24T22:24:08.550

问题标签 [databricks]

Reference