python - 在 PySpark 中使用 HiveContext 进行测试时如何防止内存泄漏

Question

我使用 pyspark 进行一些数据处理并利用 HiveContext 作为窗口函数。

为了测试代码，我使用了 TestHiveContext，基本上是从 pyspark 源代码中复制实现：

https://spark.apache.org/docs/preview/api/python/_modules/pyspark/sql/context.html

@classmethod
def _createForTesting(cls, sparkContext):
    """(Internal use only) Create a new HiveContext for testing.

    All test code that touches HiveContext *must* go through this method. Otherwise,
    you may end up launching multiple derby instances and encounter with incredibly
    confusing error messages.
    """
    jsc = sparkContext._jsc.sc()
    jtestHive = sparkContext._jvm.org.apache.spark.sql.hive.test.TestHiveContext(jsc)
    return cls(sparkContext, jtestHive)

然后我的测试继承了可以访问上下文的基类。

这工作了一段时间。但是，随着我添加更多测试，我开始注意到一些间歇性进程耗尽内存问题。现在我无法在没有失败的情况下运行测试套件。

"java.lang.OutOfMemoryError: Java heap space"

我在每次测试运行后明确停止火花上下文，但这似乎并没有杀死 HiveContext。因此，我相信每次运行新测试时它都会不断创建新的 HiveContexts，并且不会删除导致内存泄漏的旧的 HiveContexts。

关于如何拆除基类以杀死 HiveContext 的任何建议？

score 1 · Accepted Answer

如果您乐于在所有测试中使用单例来保存 Spark/Hive 上下文，则可以执行以下操作。

test_contexts.py：

_test_spark = None
_test_hive = None

def get_test_spark():
    if _test_spark is None:
        # Create spark context for tests.
        # Not really sure what's involved here for Python.
        _test_spark = ...
    return _test_spark

def get_test_hive():
    if _test_hive is None:
        sc = get_test_spark()
        jsc = test_spark._jsc.sc()
        _test_hive = sc._jvm.org.apache.spark.sql.hive.test.TestHiveContext(jsc)
    return _test_hive

然后你只需在测试中导入这些函数。

my_test.py：

from test_contexts import get_test_spark, get_test_hive

def test_some_spark_thing():
    sc = get_test_spark()
    sqlContext = get_test_hive()
    # etc

python - 在 PySpark 中使用 HiveContext 进行测试时如何防止内存泄漏

1 回答 1

Related

Reference