0

这是线程在两个或多个字段中提取具有重复值但在另一个字段中具有不同值的行的后续内容 ,目标是使用 SQL 和 DuckDB 对大于 RAM 的数据执行相同的操作。上述线程中的dupKeyEx()函数代码仅适用于可轻松放入 RAM 的数据。

以下是该线程的要点:

library(data.table)

customers <- structure(list(
  NAME = c("GEETA SHYAM RAO", "B V RAMANA", "GONTU VENKATARAMANAIAH", 
           "DAMAT RAMAKRISHNA", "MARIAM SUDHAKAR", "VELPURI LAKSHMI SUJATHA", 
           "MOHAMMED LIYAKHAT ALI", "VENKATESHWARAN PONNAMBALAM", 
           "DEVARAKONDA SATISH BABU", "GEEDI RAMULU", "KANDU OBULESU",
           "B V RAMANA", "DOKKA RAJESH", "G TULASIRAM REDDY", 
           "MALLELA CHIRANJEEVI", "MANEPALLI VENKATA RAVAMMA", 
           "DOKKA JAGADEESHWAR", "K KRISHNA", "B SUDARSHAN", "B ANNAPURNA", 
           "CHAVVA SHIVA RAMULU", "BIKASH BAHADUR CHITRE", "DARBAR ASHOK", 
           "VEMULAPALLY SANGAMESHWAR RAO", "MOHAMMED ABDUL HAKEEM ANWAR", 
           "MANEPALLI SHIV SHANKAR RAO", "MOHD MISKEEN MOHIUDDIN",
           "KOTLA CHENNAMMA", "NAYAK SURYAKANTH", "GOPIREDDY INDIRA", 
           "MEKALA SREEDEVI", "K KRISHNA", "B V RAMANA", 
           "KUMMARI VENKATESHAM", "BHAVANI CONSRUCTIONS", 
           "UPPUTHOLLA KOTAIAH", "YEDIDHA NIRMALA DEVI", "MARIAM SUDHAKAR", 
           "B ANNAPURNA", "VELPURI LAKSHMI SUJATHA", "DARBAR ASHOK", 
           "AMMANA VISHNU VARDHAN REDDY", "ZAITOON BEE", "MOHD CHAND PASHA",
           "PALERELLA RAMESH", "GEEDI SRINIVAS", "RAMAIAH SADU",
           "BIMAN BALAIAH", "KOTLA CHENNAMMA", "VENKATESHWARAN PONNAMBALAM"),
  DOB = c("13-02-1971", "15-01-1960", "01-07-1970", "10-03-1977", 
          "24-01-1954", "28-06-1971", "26-01-1980", "14-04-1969", 
          "23-09-1978", "15-08-1954", "09-10-1984", "15-01-1960", 
          "29-09-1984", "03-03-1975", "26-01-1979", "01-01-1964", 
          "21-01-1954", "01-05-1964", "12-03-1975", "12-12-1962", 
          "10-03-1982", "14-05-1983", "03-01-1950", "04-03-1962", 
          "12-05-1966", "01-06-1960", "10-03-1964", "15-07-1958", 
          "26-06-1979", "02-04-1974", "10-01-1975", "01-05-1964",
          "15-01-1960", "08-08-1977", NA, "05-04-1981", "29-08-1971",
          "24-01-1954", "12-12-1962", "28-06-1971", "03-01-1950",
          "23-06-1970", "20-02-1960", "05-07-1975", "10-01-1979", 
          "31-08-1982", "10-08-1983", "10-03-1964", "15-07-1958",
          "14-04-1969"),
  ID = c(502969, 502902, 502985, 502981, 502475, 502267, 502976, 
         502272, 502977, 502973, 502986, 502910, 502989, 502998, 502967, 
         502971, 502988, 502737, 502995, 502878, 502972, 502984, 502639, 
         502968, 502975, 502970, 502997, 502466, 502991, 502982, 502980, 
         502737, 502902, 502999, 502994, 502987, 502990, 502047, 502877, 
         502251, 502548, 502992, 503000, 502993, 502983, 502974, 502996, 
         502979, 502467, 502290), 
  PIN = c(500082, 500032, 500032, 500032, 500032, 500084, 500032, 500032,
          500032, 500032, 500032, 500033, 500032, 500084, 500084, 500032,
          5e+05, 500050, 500032, 500084, 500032, 500032, 500032, 500050,
          500032, 500032, 500045, 500032, 500084, 500032, 500032, 500084,
          500035, 500084, 500032, 500032, 500032, 500032, 500084, 500032,
          500084, 500033, 500084, 500032, 500032, 500032, 500084, 500032,
          500032, 500032)),
  .Names = c("NAME", "DOB", "ID", "PIN"),
  row.names = c(NA, -50L), class = c("data.table", "data.frame"))

目标是从上表中提取具有以下内容的行:

  • 两个或多个字段中的相同值(此处为 NAME 和 DOB);但
  • 另一个字段中的不同值(此处为 ID);和
  • 其他列应保持原样

工作函数为:

dupKeyEx <- function(DT, dup_cols, unique_cols) {
  cols <-  c(dup_cols, unique_cols)
  chr_cols <- cols[sapply(DT[, ..cols], is.character)]
  DT[, (chr_cols) := lapply(.SD, stringr::str_trim), .SDcols=chr_cols]
  mDT <-  DT[!duplicated(DT, by=cols), .N, by=dup_cols][N > 1L]
  ans <- unique(DT[mDT[, !"N"], on=dup_cols], by=cols)
  setorderv(ans, c(dup_cols, unique_cols))
  return(ans)
}

这给出了预期的结果

dupKeyEx(customers, c("NAME", "DOB"), "ID")
    
                          NAME        DOB     ID    PIN
 1:                B ANNAPURNA 12-12-1962 502877 500084
 2:                B ANNAPURNA 12-12-1962 502878 500084
 3:                 B V RAMANA 15-01-1960 502902 500032
 4:                 B V RAMANA 15-01-1960 502910 500033
 5:               DARBAR ASHOK 03-01-1950 502548 500084
 6:               DARBAR ASHOK 03-01-1950 502639 500032
 7:            KOTLA CHENNAMMA 15-07-1958 502466 500032
 8:            KOTLA CHENNAMMA 15-07-1958 502467 500032
 9:            MARIAM SUDHAKAR 24-01-1954 502047 500032
10:            MARIAM SUDHAKAR 24-01-1954 502475 500032
11:    VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
12:    VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
13: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
14: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032

但是,此功能不适用于无法轻松放入 RAM 的非常大的表。所以我想知道是否可以设计一些 SQL 查询来为非常大的表做同样的工作。来自https://duckdb.org/的DuckDB是一个非常高效的开源分析数据库,并且与 R 配合得非常好。所以我们尝试将上表导入 DuckDB 数据库,如下所示:

library(DBI)
library(duckdb)

dbFile <- "...path..to..database.../experimental_database.db"
con <- dbConnect(duckdb::duckdb(), dbFile)

# import the customers table into the database (in actual scenario, the very large bigger-than-RAM table will be imported into the database from a large csv or we may run SQL query directly on chunked parquet files, as mentioned at https://duckdb.org/docs/data/parquet) - for simplicity, here we just using the customers table to try to compose functional SQL query
dbWriteTable(con, "customers", customers)

# check if table is successfully imported
dbListTables(con)

# try reading the table
dbReadTable(con, "customers")

现在,需要编写一个功能性 SQL 查询,该查询将给出与上述dupKeEx()函数相同的结果。如果该 SQL 查询运行良好并且使用大于 RAM 的表给出有效结果,则该线程的目标将被视为已完成。

任何帮助将不胜感激。

4

1 回答 1

0

以下是如何在某些列中获取具有重复值的表行的方法: https ://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/

如果您有一个大 CSV 并且您已经知道将用于搜索/分组的列,您可以拆分 CSV,对块进行排序,然后合并排序并保存为镶木地板。

于 2021-11-05T10:34:18.207 回答