3

当我在 python 脚本中运行以下代码并直接使用 python 运行它时,我收到以下错误。当我启动 pyspark 会话然后导入考拉时,创建数据框并调用 head() 它运行良好并给了我预期的输出。

是否需要设置 SparkSession 以使考拉工作的特定方式?

from pyspark.sql import SparkSession
import pandas as pd
import databricks.koalas as ks


spark = SparkSession.builder \
        .master("local[*]") \
        .appName("Pycedro Spark Application") \
        .getOrCreate()


kdf = ks.DataFrame({"a" : [4 ,5, 6],
                    "b" : [7, 8, 9],
                    "c" : [10, 11, 12]})

print(kdf.head())

在 python 脚本中运行时出错:

    File "/usr/local/Cellar/apache-spark/3.1.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/usr/local/Cellar/apache-spark/3.1.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/usr/local/Cellar/apache-spark/3.1.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/local/Cellar/apache-spark/3.1.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
AttributeError: Can't get attribute '_fill_function' on <module 'pyspark.cloudpickle' from '/usr/local/Cellar/apache-spark/3.1.1/libexec/python/lib/pyspark.zip/pyspark/cloudpickle/__init__.py'>

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
[...]

版本:考拉:1.7.0 pyspark:版本:3.0.2


使用 StAX 时如何检查 Qname 是否等于某个值?

我正在尝试查找名称为帐户的所有标签。我试过这个...

logger.info("The tag is "+name+ " Is this an account? "+name.equalsIgnoreCase("account")+" "+ mapper.writeValueAsString(dmp.diffMain("account", name)));

但我得到...

2021-03-22 08:38:20.738  INFO [,,,] 520 --- [           main] com.jgleason.AccountIT              : The tag is {http://refdata.me.com/2011}account Is this an account? false [{"operation":"INSERT","text":"{http://refdata.me.com/2011}"},{"operation":"EQUAL","text":"account"}]

现在我知道我可以解析出{}但是有没有更简洁的方法来测试元素名称?

4

1 回答 1

3

我对 PySpark 也有类似的问题。将 PySpark 从版本 3.0.2 升级到 3.1.2 解决了这个问题。以下是更多信息:

  • Hadoop版本:3.2.2
  • 火花版本:3.1.2
  • Python版本:3.8.5

有趣的是

df = spark.read.csv("hdfs:///data.csv")
df.show(2)

运作良好,但

sc.textFile("hdfs:///data.csv") 
sc.take(2)

导致以下错误:

AttributeError: Can't get attribute '_fill_function' on <module 'pyspark.cloudpickle' from '/opt/spark/python/lib/pyspark.zip/pyspark/cloudpickle/__init__.py'>

升级 PySpark 解决了这个问题。升级的想法来自以下链接: https ://issues.apache.org/jira/browse/SPARK-29536

于 2021-06-15T05:38:27.327 回答