我有一张buildings
有 320 万行的表。我需要将此表扩展到 11 个不同的时期,以将其作为(平衡的) Paneldata 处理。这意味着对于每个物体,都有 11 个不同的年份(从 2000 年到 2010 年)需要观察。这些时期应该被称为:
2000
2001
...
2009
2010
表定义
CREATE TABLE public.buildings
(
gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
osm_id character varying(11),
name character varying(48),
type character varying(16),
geom geometry(MultiPolygon,4326),
centroid geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr smallint,
capac double precision,
instdate date,
pvid integer,
dist double precision,
gemewz integer,
n500 integer,
ibase double precision,
popden integer,
instp smallint,
b2000 double precision,
b2001 double precision,
b2002 double precision,
b2003 double precision,
b2004 double precision,
b2005 double precision,
b2006 double precision,
b2007 double precision,
b2008 double precision,
b2009 double precision,
b2010 double precision,
ibase_id integer[],
ibase_dist integer[],
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.buildings
OWNER TO postgres;
CREATE INDEX build_centroid_gix
ON public.buildings
USING gist
(st_transform(centroid, 31467));
CREATE INDEX buildings_geom_idx
ON public.buildings
USING gist
(geom);
我想在R中使用这些数据进行回归分析。
ibase_id
是一个数组gid
。
是一个与's 到对象ibase_dist
的距离相关的数组。gid
两个数组的长度始终相同。
数组中的gid
' 属于 的记录buildings
,它们位于centroid
对象中心周围 500m 的半径内,并且具有 pv=TRUE(这意味着 、dist
、instdate
、instp
&capac
是pvid
)NOT NULL
。
SELECT a.gid AS buildid, array_agg(b.gid) AS ibase_id, array_agg(round(ST_Distance(ST_Transform(a.centroid, 31467), ST_Transform(b.centroid, 31467))::integer)) AS ibase_dist
FROM buildings a
LEFT JOIN (SELECT * FROM buildings WHERE pv=TRUE) AS b ON ST_DWithin(ST_Transform(a.centroid, 31467), ST_Transform(b.centroid, 31467), 500.0)
AND a.gid <> b.gid
GROUP BY a.gid
例子:
ibase_id: {3075528,409073,322311,226643,833798,322344,226609}
;
ibase_dist {290,293,398,494,411,381,384}
UPDATE buildings
SET ibase=SUM(1/s)
FROM unnest(SELECT ibasedist FROM buildings WHERE (SELECT instp
FROM buildings
WHERE gid IN unnest(ibase_id))<year) s
对于每个时期,仅应考虑阵列的条目,其年份在面板数据的观察时期之前。(上面的查询还不起作用,因为我需要先连接数组)现在,这两个数组保存了所有年份的信息。这就是为什么我认为应该将它们添加到每个时间段,以便在扩展到面板数据之后,我计算ibase
每条记录(11x 3,200,000)。
我不需要用于回归分析的所有列。如果它会显着提高乘法的性能,我们可以坚持行(基本上省略几何列):
gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr smallint,
capac double precision,
dist double precision,
gemewz integer,
n500 integer,
ibase double precision,
popden integer,
instp smallint,
b2000 double precision,
b2001 double precision,
b2002 double precision,
b2003 double precision,
b2004 double precision,
b2005 double precision,
b2006 double precision,
b2007 double precision,
b2008 double precision,
b2009 double precision,
b2010 double precision,
ibase_id integer[],
ibase_dist integer[],
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
)
WITH (
OIDS=FALSE
解决方法
我的基本想法是创建一个periods
包含 11 个不同时期的第二个表,并将该表与该表相乘buildings
。不知道如何实现这一点。不幸的是,我对 R 没有太多经验,也没有使用R 的数据库接口。
使用由 Visual C++ build 1800、64 位和 R x64 3.2.1 编译的 PostgreSQL 9.5beta2