apache-spark - Spark.sql 选择其他表中没有匹配列的行

翻译自：https://stackoverflow.com/questions/35761259 2016-03-03T01:06:39.710

892 次

我有一个名为边缘的数据框，如下所示：

+------+------+-------------------+                                             
|   src|   dst|      mean_affinity|
+------+------+-------------------+
|  [78]|  [81]|   0.78547141736462|
|  [98]| [102]| 0.8051602291309927|
|[2540]|[3195]| 0.7734367678994718|
|   [1]|[1367]|0.37372281429944215|
| [182]|[1602]| 0.3915882096267663|
|   [1]|  [77]| 0.6999457255005836|
|  [55]|  [78]| 0.4411667943000793|
+------+------+-------------------+

我不想在 src 和 dst 列中重复任何内容。例如 78 是第一行的 src，所以它不能作为最后一行的 dst。换句话说，任何一个顶点在表格中只能出现一个。

该表也应按 mean_affinity 排序。我开始为此编写查询，但它似乎不起作用：

sqlContext.sql("""select e.src, e.dst, e.mean_affinity 
                    from edges e
                    where not exists 
                   (select src from edges where src = e.dst)""").show()

这是堆栈跟踪的一部分：

 An error occurred while calling o111.sql.
: java.lang.RuntimeException: [3.46] failure: ``)'' expected but identifier src found

                    where not exists (select src from edges where src = e.dst)
                                             ^
    at scala.sys.package$.error(package.scala:27)

谢谢！

apache-spark - Spark.sql 选择其他表中没有匹配列的行

0 回答 0

Related

Reference