“google-hadoop”的相关标签问题

0 投票

1 回答

1600 浏览

apache-spark - 通过 Hadoop 输入格式示例用于 pyspark 的 BigQuery 连接器

我有一个存储在 BigQuery 表中的大型数据集，我想将其加载到 pypark RDD 中以进行 ETL 数据处理。

我意识到 BigQuery 支持 Hadoop 输入/输出格式

https://cloud.google.com/hadoop/writing-with-bigquery-connector

并且 pyspark 应该能够使用此接口，以便通过使用“newAPIHadoopRDD”方法创建 RDD。

http://spark.apache.org/docs/latest/api/python/pyspark.html

不幸的是，两端的文档似乎很少，超出了我对 Hadoop/Spark/BigQuery 的了解。有没有人知道如何做到这一点？

2015-07-14T08:11:27.803

0 投票

1 回答

118 浏览

apache-spark - 谷歌云的 Spark 1.4 映像？

使用 bdutil，我可以在 spark 1.3.1 上找到最新版本的 tarball：

gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz

我想使用 Spark 1.4 中的一些新 DataFrame 功能。Spark 1.4 映像是否有可能可用于 bdutil 或任何解决方法？

更新：

按照安格斯戴维斯的建议，我下载并指向spark-1.4.1-bin-hadoop2.6.tgz，部署顺利；但是，调用 SqlContext.parquetFile() 时会出错。我无法解释为什么会出现这种异常，GoogleHadoopFileSystem 应该是 org.apache.hadoop.fs.FileSystem 的子类。将继续对此进行调查。

在这里问了一个关于异常的单独问题

更新：

该错误原来是 Spark 缺陷；上述问题中提供的解决方案/解决方法。

谢谢！

海鹰

apache-spark google-hadoop apache-spark-1.4

2015-07-16T23:27:56.667

0 投票

1 回答

2315 浏览

apache-spark - GoogleHadoopFileSystem 无法转换为 hadoop 文件系统？

最初的问题是尝试在 Google Cloud 上部署 spark 1.4。下载并设置后

使用 bdutil 部署很好；但是，当尝试调用 SqlContext.parquetFile("gs://my_bucket/some_data.parquet") 时，会遇到以下异常：

让我感到困惑的是，GoogleHadoopFileSystem 应该是 org.apache.hadoop.fs.FileSystem 的子类，我什至在同一个 spark-shell 实例中进行了验证：

我错过了什么，有什么解决方法吗？提前致谢！

更新：这是我的 bdutil（1.3.1 版）部署设置：

apache-spark google-hadoop

2015-07-17T15:07:30.880

0 投票

1 回答

3363 浏览

scala - Spark - 手动配置 gcs 连接器时无法从 Google Cloud Storage 读取文件

我有一个使用 bdutil 为 Google Cloud 部署的 Spark 集群。我在我的驱动程序实例上安装了一个 GUI，以便能够从中运行 IntelliJ，这样我就可以尝试在交互模式下运行我的 Spark 进程。

我遇到的第一个问题是从 IntelliJ 运行时根本没有使用 spark-env.sh 和 core-site.xml。我终于设法通过从配置文件中复制值来在 Scala 中手动设置配置。有没有办法避免这种情况？

最后一个不起作用的是，即使 gcs 连接器似乎“看到”了我设置为源的文件夹，每次它尝试读取该文件夹中的实际文件时，我都会收到 java.io.EOFException。

这是我的测试代码：

运行后得到的输出：

我错过了什么？提前感谢您的帮助！

scala intellij-idea apache-spark google-hadoop

2015-07-27T12:05:40.233

0 投票

3 回答

318 浏览

hadoop - Google Compute 引擎中的作业跟踪 URL 不起作用

我正在使用 Google Compute Engine 在 Hadoop 上运行 Mapreduce 作业（几乎所有默认配置）。在运行作业时，我得到一个形式为http://PROJECT_NAME:8088/proxy/application_X_Y/的跟踪 URL，但它无法打开。我忘了配置什么吗？

hadoop mapreduce google-compute-engine google-hadoop

2015-07-28T16:41:58.020

0 投票

1 回答

1248 浏览

hadoop - Hive cross join fails on local map join

Is there a direct way to address the following error or overall a better way to use Hive to get the join that I need? Output to a stored table isn't a requirement as I can be content with an INSERT OVERWRITE LOCAL DIRECTORY to a csv.

I am trying to perform the following cross join. ipint is a 9GB table, and geoiplite is 270MB.

I use CROSS JOIN on ipintegers instead of geoiplite because I have read that the rule is for the smaller table to be on the left, larger on the right.

Map and Reduce stages complete to 100% according to HIVE, but then

2015-08-01 04:45:36,947 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8767.09 sec

MapReduce Total cumulative CPU time: 0 days 2 hours 26 minutes 7 seconds 90 msec

Ended Job = job_201508010407_0001

Stage-8 is selected by condition resolver.

Execution log at: /tmp/myuser/.log

2015-08-01 04:45:38 Starting to launch local task to process map join; maximum memory = 12221153280

Execution failed with exit status: 3

Obtaining error information

Task failed!

Task ID: Stage-8

Logs:

/tmp/myuser/hive.log

FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask

MapReduce Jobs Launched: Job 0: Map: 38 Reduce: 1 Cumulative CPU: 8767.09 sec
HDFS Read: 9438495086 HDFS Write: 8575548486 SUCCESS

My hive config:

I have varied SET hive.auto.convert.join between true and false but with the same result.

Here are the errors in the output log from /tmp/myuser/hive.log

I am running the hive client on the Master, a Google Cloud Platform instance of type n1-highmem-8 type (8 CPU, 52GB) and workers are n1-highmem-4 (4CPU 26GB), but I suspect after MAP and REDUCE that a local join (as implied) takes place on the Master. Regardless, in bdutils I configured the JAVAOPTS for the worker nodes (n1-highmem-4) to: n1-highmem-4

SOLUTION EDIT: The solution is to organize the data the range data into a range tree.

hadoop join hive cross-join google-hadoop

2015-08-01T17:39:26.433

0 投票

2 回答

1578 浏览

apache-spark - Apache Spark GCS 连接器的速率限制

我在带有 Google Cloud Storage 连接器（而不是 HDFS，推荐）的 Google Compute Engine 集群上使用 Spark，并得到很多“速率限制”错误，如下所示：

任何人都知道任何解决方案？
有没有办法控制 Spark 的读/写速率？
有没有办法提高我的 Google 项目的速率限制？
有没有办法将本地硬盘用于不必与其他从属共享的临时文件？

谢谢！

apache-spark google-cloud-storage google-cloud-platform pyspark google-hadoop

2015-08-06T08:57:23.213

0 投票

2 回答

836 浏览

hadoop - Hive INSERT OVERWRITE 到 Google Storage 作为 LOCAL DIRECTORY 不起作用

我使用以下 Hive 查询：

我收到以下错误：

我可能做错了什么？

hadoop hive google-cloud-storage google-hadoop

2015-09-25T09:16:04.580

0 投票

1 回答

84 浏览

hadoop - 创建基于 google-cloud 的 hadoop-enable 集群后，如何更改默认存储桶？

创建基于 google-cloud 的 hadoop-enable 集群后，我想将默认存储桶更改为其他存储桶，我该怎么做？我在谷歌云文档中找不到答案。谢谢！

hadoop google-cloud-platform google-hadoop

2015-10-27T17:15:59.160

0 投票

1 回答

2129 浏览

google-cloud-dataproc - 使用 hadoop FileSystem api 访问谷歌云存储

在我的机器上，我配置了 hadoopcore-site.xml以识别该gs://方案并将 gcs-connector-1.2.8.jar 添加为 Hadoop 库。我可以运行hadoop fs -ls gs://mybucket/并获得预期的结果。但是，如果我尝试使用以下方法从 java 中进行模拟：

我在本地 HDFS 中而不是在根目录下获取文件gs://mybucket/，但这些文件以gs://mybucket. 如果我在获取 fs 之前修改了 conf conf.set("fs.default.name", "gs://mybucket");，那么我可以在 GCS 上看到文件。

我的问题是：
1.这是预期的行为吗？
2.相对于谷歌云存储客户端api，使用这个hadoop FileSystem api有什么缺点吗？

google-cloud-dataproc google-hadoop

2015-11-06T01:02:30.707

问题标签 [google-hadoop]

Reference