3

我有一个大数据集。我必须使用存储在其他 dta 文件 (Criteria_data) 中的值对数据集 (Big_data) 进行子集化。我将首先向您展示问题:

   **Big_data**                           **Criteria_data**
====================      ================================================
  lon        lat             4_digit_id   minlon  maxlon  minlat  maxlat
-76.22      44.27              0765       -78.44  -77.22  34.324  35.011
-67.55      33.19              6161       -66.11  -65.93  40.32   41.88
    .......                                   ........
 (over 1 million obs)                    (271 observations)        
====================      ================================================

我必须按如下方式对出价数据进行子集化:

use Big_data

preserve
keep if (-78.44<lon<-77.22) & (34.324<lat<35.011)
save data_0765, replace
restore

preserve
keep if (-66.11<lon<-65.93) & (40.32<lat<41.88)
save data_6161, replace
restore

....

(1)Stata中子集的有效编程应该是什么?(2) 不等式的写法是否正确?

4

3 回答 3

4

1)子集数据

主文件中有 400,000 个观测值,参考文件中有 300 个观测值,大约需要 1.5 分钟。我无法用主文件中的双倍观察结果来测试这一点,因为缺少 RAM 会使我的计算机爬行。

该策略涉及根据需要创建尽可能多的变量来保存参考纬度和经度(在 OP 的情况下为 271*4 = 1084;Stata IC 及更高版本可以处理此问题。请参阅help limits)。这需要一些重塑和附加。然后我们检查那些符合条件的大数据文件的观察结果。

clear all
set more off

*----- create example databases -----

tempfile bigdata reference

input ///
lon        lat   
-76.22      44.27
-66.0      40.85 // meets conditions
-77.10     34.8 // meets conditions
-66.00    42.0 
end

expand 100000

save "`bigdata'"
*list

clear all

input ///
str4 id   minlon  maxlon  minlat  maxlat
"0765"       -78.44  -75.22  34.324  35.011
"6161"       -66.11  -65.93  40.32   41.88
end

drop id
expand 150
gen id = _n

save "`reference'"
*list


*----- reshape original reference file -----

use "`reference'", clear

tempfile reference2

destring id, replace
levelsof id, local(lev)

gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id) 

gen lat = .
gen lon = .

save "`reference2'"


*----- create working database -----

use "`bigdata'"

timer on 1
quietly {
    forvalues num = 1/300 {
        gen minlon`num' = .
        gen maxlon`num' = .
        gen minlat`num' = .
        gen maxlat`num' = .
    }
}
timer off 1

timer on 2
append using "`reference2'"
drop i
timer off 2

*----- flag observations for which conditions are met -----

timer on 3
gen byte flag = 0
foreach le of local lev {
    quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3

*keep if flag
*keep lon lat

*list

timer list

inrange()函数意味着必须事先调整最小值和最大值以满足 OP 的严格不等式(函数测试 <=、>=)。

可能expand使用 、 相关词和by(因此数据是长格式的)进行一些扩展可以加快速度。我现在还不完全清楚。我确信在普通的 Stata 模式下有更好的方法。马塔可能会更好。

joinby也经过测试,但 RAM 又是个问题。)

编辑

以块而不是完整的数据库进行计算,显着改善了 RAM 问题。使用包含 120 万个观测值的主文件和包含 300 个观测值的参考文件,以下代码在大约 1.5 分钟内完成所有工作:

set more off

*----- create example big data -----

clear all

set obs 1200000
set seed 13056

gen lat = runiform()*100
gen lon = runiform()*100

local sizebd `=_N' // to be used in computations

tempfile bigdata
save "`bigdata'"

*----- create example reference data -----

clear all

set obs 300
set seed 97532

gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5

gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5

gen id = _n

tempfile reference
save "`reference'"


*----- reshape original reference file -----

use "`reference'", clear

destring id, replace
levelsof id, local(lev)

gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id) 
drop i

tempfile reference2
save "`reference2'"


*----- create file to save results -----

tempfile results
clear all
set obs 0

gen lon = .
gen lat = .

save "`results'"


*----- start computations -----

clear all

* local that controls # of observations in intermediate files
local step = 5000 // can't be larger than sizedb

timer clear

timer on 99
forvalues en = `step'(`step')`sizebd' {

    * load observations and join with references
    timer on 1
    local start = `en' - (`step' - 1)
    use in `start'/`en' using "`bigdata'", clear
    timer off 1

    timer on 2
    append using "`reference2'"
    timer off 2

    * flag observations that meet conditions
    timer on 3
    gen byte flag = 0
    foreach le of local lev {
        quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
    }
    timer off 3

    * append to result database
    timer on 4
    quietly {
        keep if flag
        keep lon lat
        append using "`results'"
        save "`results'", replace
    }
    timer off 4

}
timer off 99

timer list
display "total time is " `r(t99)'/60 " minutes"

use "`results'"
browse

2) 不平等

你问你的不平等是否正确。它们实际上是合法的,这意味着 Stata 不会抱怨,但结果可能出乎意料。

以下结果可能看起来令人惊讶:

. display  (66.11 < 100 < 67.93)
1

表达式计算结果为真(即 1)的情况如何?当然, Stata 首先评估66.11 < 100哪个是真的,然后看看1 < 67.93哪个也是真的。

预期的表达是(Stata 现在会做你想做的事):

. display  (66.11 < 100) & (100 < 67.93)
0

你也可以依赖函数inrange()

下面的例子和前面的解释是一致的:

. display  (66.11 < 100 < 0)
0

Stata 看到66.11 < 100哪个为真(即 1)并跟进1 < 0,哪个为假(即 0)。

于 2014-04-09T00:18:16.907 回答
2

这使用了 Roberto 的数据设置:

clear all

set obs 1200000
set seed 13056

gen lat = runiform()*100
gen lon = runiform()*100

local sizebd `=_N' // to be used in computations

tempfile bigdata
save "`bigdata'"

*----- create example reference data -----

clear all

set obs 300
set seed 97532

gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5

gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5

gen id = _n

tempfile reference
save "`reference'"


timer on 1
levelsof id, local(id_list)

foreach id of local id_list {
    sum minlat if id==`id', meanonly
    local minlat = r(min)
    sum maxlat if id==`id', meanonly
    local maxlat = r(max)

    sum minlon if id==`id', meanonly
    local minlon = r(min)
    sum maxlon if id==`id', meanonly
    local maxlon = r(max)

    preserve
        use if (inrange(lon,`minlon',`maxlon') & inrange(lat,`minlat',`maxlat')) using "`bigdata'", clear
        qui save data_`id', replace
    restore
}

timer off 1
于 2014-04-10T00:53:03.313 回答
1

我会尽量避免preserveing 和restoreing “大”文件,这样做是可能的,但会以丢失 Stata 格式为代价。

使用与 Roberto 和 Dimitriy 相同的设置,

set more off

use `bigdata', clear
merge 1:1 _n using `reference'

* check for data consistency: 
* minlat, maxlat, minlon, maxlon are either all defined or all missing
assert inlist( mi(minlat) + mi(maxlat) + mi(minlon) + mi(maxlon), 0, 4)

* this will come handy later
gen byte touse = 0

* set up and cycle over the reference data
count if !missing(minlat)
forvalues n=1/`=r(N)' {
    replace touse = inrange(lat,minlat[`n'],maxlat[`n']) & inrange(lon,minlon[`n'],maxlon[`n'])
    local thisid = id[`n']
    outfile lat lon if touse using data_`thisid'.csv, replace comma
}

在你的机器上计时。你可以避免touse并且thisid只有outfile在循环中只有一个,但它的可读性会降低。

以后可以infile lat lon using data_###.csv, clear。如果您真的需要适当的 Stata 文件,您可以使用

clear
local allcsv : dir . files "*.csv"
foreach f of local allcsv {
   * change the filename
   local dtaname = subinstr(`"`f'"',".csv",".dta",.)
   infile lat lon using `"`f'"', clear
   if _N>0 save `"`dtaname'"', replace
}

时间也一样。我保护了save一些模拟数据集是空的。我认为这在我的机器上比 1.5 分钟快,包括转换。

于 2014-04-10T14:20:04.413 回答