mysql - 使用 sqldf 在 R 中对重复项进行编号

Question

我有一个包含重复行的数据集，我希望将它们编号如下：

原始数据集：

DF <- structure(list(pol_no = c(1L, 1L, 2L, 2L, 2L), os = c(23L, 33L, 
45L, 56L, 45L), paid = c(45L, 67L, 78L, 89L, 78L)), .Names = c("pol_no", 
"os", "paid"), class = "data.frame", row.names = c(NA, -5L))

看起来像这样：

> DF
  pol_no os paid
1      1 23   45
2      1 33   67
3      2 45   78
4      2 56   89
5      2 45   78

我希望将 pol_no 中的重复项编号如下：

pol_no   os   paid  count
1        23    45      1
1        33    67      2
2        45    78      1
2        56    89      2
2        45    78      3

提前非常感谢。

问候，

曼西

编辑：添加dput()输出以使其可重现和固定格式。

score 3 · Accepted Answer

带有 RPostgreSQL 的 sqldf

PostgreSQL 的 SQL 窗口函数有助于解决这类问题。有关将 PostgreSQL 与 sqldf 一起使用的更多信息，请参阅sqldf 主页上的FAQ#12 ：

library(RPostgreSQL)
library(sqldf)
sqldf('select *, rank() over  (partition by "pol_no" order by CTID) count
       from "DF" 
       order by CTID ')

带有 RSQLite 的 sqldf

sqldf 默认通过 RSQLite 使用 SQLite。尽管 SQLite 缺少 PostgreSQL 的窗口功能，但使用 SQLite 的整个安装过程要简单得多，因为它是一个普通的包安装，无需额外操作（而对于 PostgreSQL，PostgreSQL 本身必须单独安装和配置）。尽管 SQL 语句的长度实际上是相似的，但缺少这些功能，使用 SQLite 的 SQL 语句会更加复杂：

# if RPostgreSQL was previously attached & loaded then detach and & unload it
detach("package:RPostgreSQL", unload = TRUE)

sqldf("select a.*, count(*) count
       from DF a, DF b 
       where a.pol_no = b.pol_no and b.rowid <= a.rowid group by a.rowid"
)

R大道

最后，我们展示了一个根本不使用 sqldf 而只是核心 R 功能的解决方案：

transform(DF, count = ave(pol_no, pol_no, FUN = seq_along))

mysql - 使用 sqldf 在 R 中对重复项进行编号

1 回答 1

Related

Reference