我遇到了以下问题,进行了一些测试,以证明 pyspark 中的纯 pyarrow UDF 与始终通过 pandas 相比的有用性。
import awkward
import numpy
import pandas
import pyarrow
counts = numpy.random.randint(0,20,size=200000)
content = numpy.random.normal(size=counts.sum())
test_jagged = awkward.JaggedArray.fromcounts(counts, content)
test_arrow = awkward.toarrow(test_jagged)
def awk_arrow(col):
jagged = awkward.fromarrow(col)
jagged2 = jagged**2
return awkward.toarrow(jagged2)
def pds_arrow(col):
pds = col.to_pandas()
pds2 = pds**2
return pyarrow.Array.from_pandas(pds2)
out1 = awk_arrow(test_arrow)
out2 = pds_arrow(test_arrow)
out3 = awkward.fromarrow(out1)
out4 = awkward.fromarrow(out2)
type(out3)
type(out4)
产量
<class 'awkward.array.jagged.JaggedArray'>
<class 'awkward.array.masked.BitMaskedArray'>
和
out3 == out4
产生(在堆栈跟踪的末尾):
AttributeError: no column named 'reshape'
查看数组:
print(out3);print();print(out4);
[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]
[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]
可以看到数组的内容和形状都是一样的,但是表面上没有可比性,非常反直觉。是否有充分的理由将没有 Null 的密集锯齿状结构表示为 BitMaskedArray?