performance - 指标和性能

Question

我是 Geotools 的新手并面临这个问题：我在 PostGis 中注入了大约 2MB 的 shapefile 信息（大约 5800 个条目），令人惊讶的是，它大约需要 6 分钟才能完成！很烦人，因为我的“真实”数据集按 shapefile 组（shp、dbf ...）可能高达 25MB，需要 100 个组。

有人告诉我这可能是一个索引问题，因为 Postgre 会在每个 INSERT 上更新表的索引。有没有办法在我的大量插入期间“禁用”这些索引并告诉数据库最后创建所有索引？还是有更好的方法来做到这一点？

这是我的代码片段：

Map<String, Object> shpparams = new HashMap<String, Object>();
shpparams.put("url", "file://" + path);
FileDataStore shpStore = (FileDataStore) shpFactory.createDataStore(shpparams);
SimpleFeatureCollection features = shpStore.getFeatureSource().getFeatures();
if (schema == null) {
    // Copy schema and change name in order to refer to the same
    // global schema for all files
    SimpleFeatureType originalSchema = shpStore.getSchema();
    Name originalName = originalSchema.getName();
    NameImpl theName = new NameImpl(originalName.getNamespaceURI(), originalName.getSeparator(), POSTGIS_TABLENAME);
    schema = factory.createSimpleFeatureType(theName, originalSchema.getAttributeDescriptors(), originalSchema.getGeometryDescriptor(),
            originalSchema.isAbstract(), originalSchema.getRestrictions(), originalSchema.getSuper(), originalSchema.getDescription());
    pgStore.createSchema(schema);
}
// String typeName = shpStore.getTypeNames()[0];
SimpleFeatureStore featureStore = (SimpleFeatureStore) pgStore.getFeatureSource(POSTGIS_TABLENAME);

// Ajout des objets du shapefile dans la table PostGIS
DefaultTransaction transaction = new DefaultTransaction("create");
featureStore.setTransaction(transaction);
try {
    featureStore.addFeatures(features);
    transaction.commit();
} catch (Exception problem) {
    LOGGER.error(problem.getMessage(), problem);
    transaction.rollback();
} finally {
    transaction.close();
}
shpStore.dispose();

感谢您的帮助！

所以我测试了你的解决方案，但没有什么能帮助我更多......完成时间仍然相同。这是我的表定义：

FID 序列号 10
the_geom 几何 2147483647
xxx varchar 10
xxx int4 10
xxx varchar 3
xxx varchar 2
xxx浮动8 17
xxx浮动8 17
xxx浮动8 17

所以我不认为问题与我的代码或数据库直接相关，可能是由于系统限制（RAM、缓冲区......）。我会在接下来的几天里看看这个。

你有更多的想法吗？

score 1 · Accepted Answer

我回来了这个问题的解决方案。经过多次调查，我发现物理网络是问题所在：使用本地数据库（geotools 应用程序本地）没有问题。网络为每个 INSERT 语句请求增加了 200 或 300 毫秒。随着大量数据注入数据库，响应时间非常长！

所以原始 Postgis 配置或我的代码片段没有问题......

谢谢大家的参与。

score 0 · Accepted Answer

您可以通过以下步骤检查数据库中的索引或 PK/FK 约束是否真的是瓶颈：

1）确保数据插入到单个事务中（禁用自动提交）

2）删除所有索引并在数据导入后重新创建它们（您不能禁用索引）

DROP INDEX my_index;
CREATE INDEX my_index ON my_table (my_column);

3) 删除或禁用 PK/FK 约束并在数据导入后重新创建或重新启用它们。您可以在数据导入期间跳过对 PK/FK 约束的检查，而无需删除它们

ALTER TABLE my_table DISABLE trigger ALL;
-- data import
ALTER TABLE my_table ENABLE trigger ALL;

这种方法的缺点是在禁用检查时不检查 PK/FK 约束是否有插入/更新的数据。当然，当您在数据导入后重新创建现有数据时，也会对现有数据强制执行 PK/FK 约束。

您还可以将 PK/FK 约束的检查推迟到事务结束。当且仅当 PK/FK 约束被定义为可延迟（不是默认值）时，这是可能的：

ALTER TABLE my_table ADD PRIMARY KEY (id) DEFERRABLE INITIALLY DEFERRED;

START TRANSACTION;
-- data import
COMMIT; -- constraints are checked here

或者

ALTER TABLE my_table ADD PRIMARY KEY (id) DEFERRABLE INITIALLY IMMEDIATE;

START TRANSACTION;
SET CONSTRAINTS ALL DEFERRED;
-- data import
COMMIT; -- constraints are checked here

编辑：

要缩小问题的原因，您可以使用应用程序导入数据，进行数据库转储（使用插入语句）并再次导入该数据库转储。这应该让您了解普通导入需要多长时间以及应用程序的开销是多少。

使用语句创建数据库的仅数据转储INSERT（COPY语句会更快，但您的应用程序也使用插入，因此这更便于比较）：

pg_dump <database> --data-only --column-inserts -f data.sql

再次创建空数据库模式并导入数据（基本时间）：

date; psql <database> --single-transaction -f data.sql > /dev/null; date

也许你可以更深入地了解这个问题。

performance - 指标和性能

2 回答 2

Related

Reference