python - 来自 Python (PySpark) 的 Spark 自定义 Hadoop 配置？

Question

我有Python应该在Apache Spark集群上运行的基于 - 的脚本。

我有Hadoop MapReduce InputFormat作为RDD. 这里没有问题。

问题是我想Hadoop Configuration用加载的附加资源文件和属性集来构建自定义。意图是在Configuration里面使用修饰符Python SparkContext。

我可以构建JVM可以构建和加载所需的代码Hadoop Configuration。如何将其附加到Python使用中PySpark？

有谁知道这一切是如何实现的？

score 0 · Accepted Answer

我已经为我的案例解决了这个难题，因为我放弃了Configuration在线修改的要求并且仅基于自定义的一组 Hadoop 配置 *.xml 文件。

起初，我编写了 Java 类，它将附加层的配置添加到org.apache.hadoop.conf.Configuration. 它的静态初始化附加配置默认资源：

public class Configurator {

    static {

        // We initialize needed Hadoop configuration layers default configuration
        // by loading appropriate classes.

        try {
            Class.forName("org.apache.hadoop.hdfs.DistributedFileSystem");
        } catch (ClassNotFoundException e) {
            LOG.error("Failed to initialize HDFS configuartion layer.");
        }

        try {
            Class.forName("org.apache.hadoop.mapreduce.Cluster");
        } catch (ClassNotFoundException e) {
            LOG.error("Failed to initialize YARN/MapReduce configuartion layer.");
        }

        // We do what actually HBase should: default HBase configuration
        // is added to default Hadoop resources.
        Configuration.addDefaultResource("hbase-default.xml");
        Configuration.addDefaultResource("hbase-site.xml");
    }

    // Just 'callable' handle.
    public void init() {
    }

}

所以现在如果有人只是加载我的Configurator，他或她会通过类路径搜索以下 inftastructure 配置：core、hdfs、MapReduce、YARN、HBase。合适的文件是core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml. 如果我需要额外的层，扩展没有问题。

Configurator.init()提供只是为了有更简单的类加载句柄。

现在我需要在 Spark 上下文启动期间扩展 Python Spark 脚本以访问配置器：

# Create minimal Spark context.
sc = SparkContext(appName="ScriptWithIntegratedConfig")

# It's critical to initialize configurator so any
# new org.apach.hadoop.Configuration object loads our resources.
sc._jvm.com.wellcentive.nosql.Configurator.init()

所以现在正常的 Hadoopnew Configuration()构建（这在PythonRDD基于 Hadoop 的数据集的基础设施内部很常见）导致从类路径加载所有层配置，我可以在其中放置所需集群的配置。

至少对我有用。

python - 来自 Python (PySpark) 的 Spark 自定义 Hadoop 配置？

1 回答 1

Related

Reference