r - 在 SparkR 1.4.0 中读取文本文件

Question

有谁知道如何在 SparkR 1.4.0 版中读取文本文件？有没有可用的 Spark 包？

score 3 · Accepted Answer

火花 1.6+

您可以使用text输入格式将文本文件读取为DataFrame：

read.df(sqlContext=sqlContext, source="text", path="README.md")

火花 <= 1.5

简短的回答是你没有。SparkR 1.4 几乎完全从低级 API 中剥离出来，只留下了有限的数据帧操作子集。正如您可以在旧的 SparkR 网页上看到的那样：

截至 2015 年 4 月，SparkR 已正式合并到 Apache Spark 中，并在即将发布的版本 (1.4) 中发布。(...) R 中对 Spark 的初始支持集中在高级操作而不是低级 ETL。

可能最接近的方法是使用以下方法加载文本文件spark-csv：

> df <- read.df(sqlContext, "README.md", source = "com.databricks.spark.csv")
> showDF(limit(df, 5))
+--------------------+
|                  C0|
+--------------------+
|      # Apache Spark|
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
+--------------------+

由于典型的 RDD 操作（如map,或）也已消失flatMap，因此无论如何它可能就是您想要的。reducefilter

现在，底层 API 仍然处于底层，因此您始终可以执行以下操作，但我怀疑这是一个好主意。SparkR 开发人员很可能有充分的理由将其设为私有。引用:::手册页：

在代码中使用 ':::' 通常是一个设计错误，因为相应的对象可能出于充分的理由而保留在内部。如果您觉得需要访问该对象而不是仅仅进行检查，请考虑联系包维护人员。

即使您愿意忽略良好的编码实践，我也很可能不值得花时间。1.4 之前的低级 API 非常缓慢和笨拙，并且没有 Catalyst 优化器的所有优点，它很可能与内部 1.4 API 相同。

> rdd <- SparkR:::textFile(sc, 'README.md')
> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)

[[1]]
[1] 14

[[2]]
[1] 0

[[3]]
[1] 78

不是spark-csv，不像textFile，忽略空行。

score 0 · Accepted Answer

请点击链接 http://ampcamp.berkeley.edu/5/exercises/sparkr.html

我们可以简单地使用 -

 textFile <- textFile(sc, "/home/cloudera/SparkR-pkg/README.md")

在检查 SparkR 代码时，Context.R 具有 textFile 方法，因此理想情况下，SparkContext 必须具有 textFile API 才能创建 RDD，但这在 doc 中缺失。

# Create an RDD from a text file.
#
# This function reads a text file from HDFS, a local file system (available on all
# nodes), or any Hadoop-supported file system URI, and creates an
# RDD of strings from it.
#
# @param sc SparkContext to use
# @param path Path of file to read. A vector of multiple paths is allowed.
# @param minPartitions Minimum number of partitions to be created. If NULL, the default
#  value is chosen based on available parallelism.
# @return RDD where each item is of type \code{character}
# @export
# @examples
#\dontrun{
#  sc <- sparkR.init()
#  lines <- textFile(sc, "myfile.txt")
#}
textFile <- function(sc, path, minPartitions = NULL) {
  # Allow the user to have a more flexible definiton of the text file path
  path <- suppressWarnings(normalizePath(path))
  # Convert a string vector of paths to a string containing comma separated paths
  path <- paste(path, collapse = ",")

  jrdd <- callJMethod(sc, "textFile", path, getMinPartitions(sc, minPartitions))
  # jrdd is of type JavaRDD[String]
  RDD(jrdd, "string")
}

按照链接 https://github.com/apache/spark/blob/master/R/pkg/R/context.R

对于测试用例 https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_rdd.R

score 0 · Accepted Answer

事实上，您也可以使用 databricks/spark-csv 包来处理 tsv 文件。

例如，

data <- read.df(sqlContext, "<path_to_tsv_file>", source = "com.databricks.spark.csv", delimiter = "\t")

此处提供了许多选项 - databricks-spark-csv#features

r - 在 SparkR 1.4.0 中读取文本文件

3 回答 3

Related

Reference