data-modeling - 如何在数据仓库中组织数据集市

Question

我正在为我的公司建立一个新的企业数据仓库，使用 Kimball 方法（即数据集市的集合）。我想知道组织我的数据集市的最佳做法（或通常做法）。

1) 每个数据集市是否应该是 EDW 服务器上的单独数据库？或者，每个数据集市应该是单个数据库的模式吗？

2) 对于一致性维度（即适用于 2 个以上数据集市/主题领域/业务流程的维度），它们是否应该存在于单独的模式或数据库中？或者，因为我们不会事先知道哪些维度将被符合（因为我们一次要构建一个数据集市），我们是否应该简单地在我们的企业总线矩阵（Excel 文件）中识别符合的维度并且不努力进行隔离他们在EDW？

3)

a) 是否应该在 EDW 中确定事实表和维度表？例如，由于我将维护将与自助 BI 用户共享的每个星型模式的图表，因此通过某种方法识别数据库中的事实表是否有任何价值，比如在表名前加上“事实”？

b) 如果应该在 EDW 中识别事实和维度表，那么识别机制应该是什么？应该通过表名前缀吗？是否应该通过将表组织成单独的“事实”和“维度”模式？

score 3 · Accepted Answer

1) Should each data mart be a separate database on the EDW server? Or, should each data mart be a schema of a single database?

This (also) depends on what database software you are using and whether it imposes any kind of limitation on, for example, using data across multiple schemas.

In any case, you'll inevitably need to connect to, and query/join data from, different data marts, to address some business cases or even ETL processes. You may also need to segregate/secure access to specific data marts, load each data mart independently or using different schedules/methods, etc.

For these reasons, it is usually good enough to keep the data warehouse in one database organized into schemas: one schema per data mart plus specific schemas for shared objects (like conformed dimensions). This way you can still use data that is scattered across multiple data marts, easily control access to specific schemas / data marts, and it'll be easier for users to locate specific metrics/facts.

2) For conformed dimensions (i.e., dimensions that apply to 2+ data marts / subject areas / business processes), should they live in a separate schema or database? Or, because we won't know in advance what dimensions will be conformed (since we are building a data mart at a time), should we simply identify the conformed dimensions in our enterprise bus matrix (Excel file) and make no effort to segregate them in the EDW?

If you organize data marts into schemas, it makes sense to have a specific schema to hold these conformed dimensions and other shared data. This way, different users that may have access only to specific data marts can still use the conformed/shared dimensions.

3)

a) Should fact tables and dimension tables be identified at all in the EDW? For example, since I will be maintaining a diagram of each star schema that will be shared with self-service BI users, is there any value in identifying fact tables in the DB via some method, say prefixing the table name with 'Fact'?

Yes, using prefixes makes it easier to locate metrics (facts) and dimensions when users are browsing the data warehouse, something like F_tableName or D_tableName would already go a long way.

b) If fact and dimension tables should be identified in the EDW, what should be the identification mechanism? Should it be via table name prefixing? Should it be via organizing the tables into separate 'Fact' and 'Dimension' schemas?

Same as above :)

data-modeling - 如何在数据仓库中组织数据集市

1 回答 1

Related

Reference