1

我正在为有点吱吱作响的 OLTP 创建一个 DW。

我面临的一个问题是 OLTP 数据库中没有太多的数据完整性。一个例子是郊区字段。

这个郊区字段是 OLTP UI 上的一个自由文本字段,这意味着我们在该字段中有值,此外还有空字符串和 NULL 值。

我们通常会如何处理这个问题?我想出的场景是:

  1. 按原样导入数据(不理想)
  2. 在我的 ETL 过程中,将任何空字符串视为 NULL,并将其替换为 DW 中的单词“Unknown”
  3. 在 DW 中将空字符串和 NULL 作为空字符串导入

仅供参考,我正在使用 Microsoft BI 堆栈(SQL Server、SSIS、SSAS、SSRS)

4

2 回答 2

4

The short answer is, it depends on what NULL and empty strings mean in the source system.

This general question (handling NULL) has been discussed a lot, e.g. here, here, here etc. I think the most important point to remember is that a data warehouse is just a database; it may have a very specific type of schema and be designed for one purpose, but it's still just a database and any general advice on NULL still applies.

(As a side note, I sometimes prefer to talk about a "reporting database" rather than a "data warehouse", because it keeps things in perspective. Some DBAs and developers start making plans for huge server farms and multi-year ETL projects as soon as they hear the words "data warehouse", but in the end it's just a reporting database.)

Anyway, it isn't completely clear where you want to use NULL but it looks like it may be an attribute on a dimension.

I (probably) wouldn't use any of your three approaches, but it depends on the meaning of your data. Importing the data as-is is not useful because part of the value of a data warehouse is that the data has been cleaned and is consistent, which makes querying and comparing data along other dimensions much easier.

Replacing empty strings with 'Unknown' may or may not be correct: what does an empty string mean in the source system? There's a big difference between "it means there's no suburb" and "it means we don't know if there's a suburb". Assuming that an empty string means "no suburb" and NULL means "unknown" then I would import the empty strings as they are, but replace NULL with 'Unknown'. The main reason for doing that is that if the Suburb field will be used as a filter condition in a report, it's easier for users (and possibly your reporting tool) to work with a non-NULL value like 'UNKNOWN'. And if there is no consistency in the source system and you don't know what empty strings and NULLs mean, then you need to clarify that first and ideally fix the source system too (another benefit of a DWH is that it helps to identify inconsistencies and data handling errors in source systems).

Your last idea to convert NULLs to empty strings is the same issue: what does a NULL actually mean in the source system? If it means "no suburb" then replacing it with an empty string is probably a good idea, but if it means something else then you should handle it as something else.

So to summarize, my preference would be to import empty strings as-is, and convert NULL to 'UNKNOWN', but I can't be sure that this actually makes sense in your case. There's no single answer to this question because it all depends on your specific data and what it means. But there's no problem with using NULL in a data warehouse (or any other database) as long as you do it consistently and with a clear understanding of how the source systems handle data.

于 2013-04-18T14:23:13.537 回答
1

从语义上讲,NULL 通常意味着未定义/未知。而“”空字符串意味着该值已知为空。在您的郊区示例中,NULL 可能意味着不知道给定记录是否存在郊区,而“”可能意味着给定记录肯定没有郊区。

如果在您的情况下 NULL 和 "" 的含义相同,最好在导入 DW 之前将这两个值标准化为相同的值(例如 ""),以便以后更容易进行报告(以免 NULL = 50 和 "" = 34 并且必须将它们加在一起)。

于 2013-04-18T03:22:27.627 回答