r - 如何在 sparkR 中使用 getItem(x, ...) 以及如何对列中的特定值进行子集化？

Question

我有一个 sparkR 数据框，cust_sales我只需CQ98901282要从列中提取值cust_id，在我们使用的 R 中cust_sales$cust_id[3]。

我的建议是我们可以getItem(x, ...)用来提取，如果是这样，参数“x”将是列cust_sales$cust_id

争论中会出现什么“......”

如果我的建议是错误的getItem(x, ...)，那么在我的示例中它的用途是什么以及如何使用它。

+----------+----------+-----------+
|   cust_id|      date|Total_trans|
+----------+----------+-----------+
|CQ98901280|2015-06-06|          1|
|CQ98901281|2015-05-01|          1|
|CQ98901282|2015-05-02|          1|
|CQ98901283|2015-05-03|          1|
|CQ98901284|2015-04-01|          6|
|CQ98901285|2015-04-02|          8|
|CQ98901286|2015-04-03|         13|
|CQ98901287|2015-04-04|          3|
|CQ98901288|2015-04-05|          3|
|CQ98901289|2015-04-08|         16|

TIA，阿伦

score 1 · Accepted Answer

Spark 数据帧不支持随机行访问，您对getItem函数的工作原理有错误的认识。它旨在从非原子字段（如地图或数组）中提取数据：

> writeLines('{"foo": [0, 1], "bar": {"x": 3, "y": 4}}', "example.json")
> df <- SparkR::jsonFile(sqlContext, "example.json")
> printSchema(df)
root
 |-- bar: struct (nullable = true)
 |    |-- x: long (nullable = true)
 |    |-- y: long (nullable = true)
 |-- foo: array (nullable = true)
 |    |-- element: long (containsNull = true)
> select(df, getItem(df$bar, "x"), getItem(df$bar, "y")) %>% head()
  bar[x] bar[y]
1      3      4

出于某种原因，我无法使其与数组一起使用，而是使用 PySpark

>>> df = sqlContext.read.json("example.json")
>>> df.select(df.foo.getItem(0)).show()
>>> df.select(df.foo.getItem(0), df.foo.getItem(1), df.bar.getItem("x")).show()
+------+------+------+
|foo[0]|foo[1]|bar[x]|
+------+------+------+
|     0|     1|     3|
+------+------+------+

r - 如何在 sparkR 中使用 getItem(x, ...) 以及如何对列中的特定值进行子集化？

1 回答 1

Related

Reference