windows - winutils spark windows installation env_variable

Question

I am trying to install Spark 1.6.1 on windows 10 and so far I have done the following...

Downloaded spark 1.6.1, unpacked to some directory and then set SPARK_HOME
Downloaded scala 2.11.8, unpacked to some directory and then set SCALA_HOME
Set the _JAVA_OPTION env variable
Downloaded the winutils from https://github.com/steveloughran/winutils.git by just downloading the zip directory and then set HADOOP_HOME env variable. (Not sure if this was incorrect, I could not clone the directory because of permission denied).

When I go to spark home and run bin\spark-shell I get

'C:\Program' is not recognized as an internal or external command, operable program or batch file.

I must be missing something, I don't see how I could be running the bash scripts anyway from windows environment. But hopefully I don't need to understand just to get this working. I have been following this guy's tutorial - https://hernandezpaul.wordpress.com/2016/01/24/apache-spark-installation-on-windows-10/ . Any help would be appreciated.

score 8 · Accepted Answer

您需要下载 winutils 可执行文件，而不是源代码。

你可以在这里下载它，或者如果你真的想要整个 Hadoop 发行版，你可以在这里找到 2.6.0 二进制文件。然后，您需要设置HADOOP_HOME到包含 winutils.exe 的目录。

另外，请确保您放置 Spark 的目录是不包含空格的目录，这非常重要，否则将无法正常工作。

设置完成后，您无需开始spark-shell.sh，而是开始spark-shell.cmd：

C:\Spark\bin>spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/05/18 19:31:56 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Spark/lib/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/bin/../lib/datanucleus-core-3.2.10.jar."
16/05/18 19:31:56 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Spark/lib/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/bin/../lib/datanucleus-api-jdo-3.2.6.jar."
16/05/18 19:31:56 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Spark/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/bin/../lib/datanucleus-rdbms-3.2.9.jar."
16/05/18 19:31:56 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/05/18 19:31:56 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/05/18 19:32:01 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/05/18 19:32:01 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/05/18 19:32:07 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Spark/lib/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/bin/../lib/datanucleus-core-3.2.10.jar."
16/05/18 19:32:07 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Spark/lib/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/bin/../lib/datanucleus-api-jdo-3.2.6.jar."
16/05/18 19:32:07 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Spark/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/bin/../lib/datanucleus-rdbms-3.2.9.jar."
16/05/18 19:32:07 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/05/18 19:32:08 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/05/18 19:32:12 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/05/18 19:32:12 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.

scala>

score 0 · Accepted Answer

在 Windows 上，您需要明确指定 hadoop 二进制文件的位置。

以下是设置 spark-scala 独立应用程序的步骤。

下载 winutil.exe 并将其放在 bin 文件夹下的某个文件夹/目录中，例如 c:\hadoop\bin

完整路径类似于 c:\hadoop\bin\winutil.exe

现在在创建 sparkSession 时，我们需要指定此路径。参考下面的代码片段：

包 com.test.config

 import org.apache.spark.sql.SparkSession

 object Spark2Config extends Serializable{
            System.setProperty("hadoop.home.dir", "C:\\hadoop")
            val spark = SparkSession.builder().appName("app_name").master("local").getOrCreate()       
 }

windows - winutils spark windows installation env_variable

2 回答 2

Related

Reference