3

我正在使用 Google Colab 并尝试在 pyspark 调用的函数中使用 Stanford sutime 库。

此函数获取给定 RDD 的一行,然后使用 sutime 库返回一个(句子,频率)对。

def convert1(row):
  s=str(row.dosefrequency)    
  s=s.lower()  
  try:
    i=sutime.parse(s)     #This parses the input string and output the frequency like P1D(Per Day).
    if len(i)>0:
      if 'timex-value' in i[0]:
        return [s,i[0]['timex-value']]
      else:
        return []
    else:
      return []
  except Exception as e:
    return []

我的输入 RDD 看起来像:-

rdd.take(3)
'''
[Row(practiceid=701, dosequantity='200', dosefrequency='take 2 tablet by oral route  every day', count_dosequantity=716, count_dosefrequency=1, count_patientuid=306, DM Current -hychqudose='200mg', DM Expected Value='400mg'),
 Row(practiceid=595, dosequantity='200', dosefrequency='take 1 tablet by oral route 2 times every day', count_dosequantity=327, count_dosefrequency=1, count_patientuid=230, DM Current -hychqudose='200mg', DM Expected Value='400mg'),
 Row(practiceid=623, dosequantity='200', dosefrequency='take 1 (200MG)  by oral route 2 times every day', count_dosequantity=339, count_dosefrequency=1, count_patientuid=180, DM Current -hychqudose='200mg', DM Expected Value='400mg')]
'''

这就是我使用 flatmap 调用函数的方式:

details = rdd.flatMap(lambda row: convert1(row)).collect()

但它给了我以下错误:

Traceback (most recent call last):
  File "/usr/lib/python3.6/pickle.py", line 916, in save_global
    __import__(module_name, level=0)
ModuleNotFoundError: No module named 'edu'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 841, in save_global
    return Pickler.save_global(self, obj, name=name)
  File "/usr/lib/python3.6/pickle.py", line 922, in save_global
    (obj, module_name, name))
_pickle.PicklingError: Can't pickle <java class 'edu.stanford.nlp.python.SUTimeWrapper'>: it's not found as edu.stanford.nlp.python.edu.stanford.nlp.python.SUTimeWrapper

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/pickle.py", line 916, in save_global
    __import__(module_name, level=0)
ModuleNotFoundError: No module named 'java'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 841, in save_global
    return Pickler.save_global(self, obj, name=name)
  File "/usr/lib/python3.6/pickle.py", line 922, in save_global
    (obj, module_name, name))
_pickle.PicklingError: Can't pickle <java class 'java.lang.Object'>: it's not found as java.lang.java.lang.Object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/serializers.py", line 468, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 1097, in dumps
    cp.dump(obj)
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 357, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 751, in save_tuple
    save(element)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 501, in save_function
    self.save_function_tuple(obj)
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 730, in save_function_tuple
    save(state)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 781, in save_list
    self._batch_appends(obj)
  File "/usr/lib/python3.6/pickle.py", line 808, in _batch_appends
    save(tmp[0])
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 496, in save_function
    self.save_function_tuple(obj)
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 730, in save_function_tuple
    save(state)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 852, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 496, in save_function
    self.save_function_tuple(obj)
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 730, in save_function_tuple
    save(state)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 852, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.6/pickle.py", line 605, in save_reduce
    save(cls)
  File "/usr/lib/python3.6/pickle.py", line 490, in save
    self.save_global(obj)
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 850, in save_global
    return self.save_dynamic_class(obj)
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 662, in save_dynamic_class
    obj=obj)
  File "/usr/lib/python3.6/pickle.py", line 610, in save_reduce
    save(args)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 751, in save_tuple
    save(element)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 736, in save_tuple
    save(element)
  File "/usr/lib/python3.6/pickle.py", line 490, in save
    self.save_global(obj)
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 850, in save_global
    return self.save_dynamic_class(obj)
  File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 666, in save_dynamic_class
    save(clsdict)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 496, in save
    rv = reduce(self.proto)
TypeError: can't pickle _jpype._JMethod objects
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/usr/lib/python3.6/pickle.py in save_global(self, obj, name)
    915         try:
--> 916             __import__(module_name, level=0)
    917             module = sys.modules[module_name]

ModuleNotFoundError: No module named 'edu'

During handling of the above exception, another exception occurred:

PicklingError                             Traceback (most recent call last)
65 frames
PicklingError: Can't pickle <java class 'edu.stanford.nlp.python.SUTimeWrapper'>: it's not found as edu.stanford.nlp.python.edu.stanford.nlp.python.SUTimeWrapper

During handling of the above exception, another exception occurred:

ModuleNotFoundError                       Traceback (most recent call last)
ModuleNotFoundError: No module named 'java'

During handling of the above exception, another exception occurred:

PicklingError                             Traceback (most recent call last)
PicklingError: Can't pickle <java class 'java.lang.Object'>: it's not found as java.lang.java.lang.Object

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
TypeError: can't pickle _jpype._JMethod objects

During handling of the above exception, another exception occurred:

PicklingError                             Traceback (most recent call last)
/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/serializers.py in dumps(self, obj)
    476                 msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
    477             print_exec(sys.stderr)
--> 478             raise pickle.PicklingError(msg)
    479 
    480 

PicklingError: Could not serialize object: TypeError: can't pickle _jpype._JMethod objects

尽管我尝试使用以下代码显式调用 convert1 函数:

rr=rdd.take(10)
for i in range(10):
  x=convert1(rr[i])
  print(x)

上面的代码对我来说非常好。它不适用于 flatMap。

如果需要,请询问必要的详细信息。

4

0 回答 0