The short answer is, it depends on what NULL
and empty strings mean in the source system.
This general question (handling NULL
) has been discussed a lot, e.g. here, here, here etc. I think the most important point to remember is that a data warehouse is just a database; it may have a very specific type of schema and be designed for one purpose, but it's still just a database and any general advice on NULL
still applies.
(As a side note, I sometimes prefer to talk about a "reporting database" rather than a "data warehouse", because it keeps things in perspective. Some DBAs and developers start making plans for huge server farms and multi-year ETL projects as soon as they hear the words "data warehouse", but in the end it's just a reporting database.)
Anyway, it isn't completely clear where you want to use NULL
but it looks like it may be an attribute on a dimension.
I (probably) wouldn't use any of your three approaches, but it depends on the meaning of your data. Importing the data as-is is not useful because part of the value of a data warehouse is that the data has been cleaned and is consistent, which makes querying and comparing data along other dimensions much easier.
Replacing empty strings with 'Unknown' may or may not be correct: what does an empty string mean in the source system? There's a big difference between "it means there's no suburb" and "it means we don't know if there's a suburb". Assuming that an empty string means "no suburb" and NULL
means "unknown" then I would import the empty strings as they are, but replace NULL
with 'Unknown'. The main reason for doing that is that if the Suburb field will be used as a filter condition in a report, it's easier for users (and possibly your reporting tool) to work with a non-NULL
value like 'UNKNOWN'. And if there is no consistency in the source system and you don't know what empty strings and NULLs mean, then you need to clarify that first and ideally fix the source system too (another benefit of a DWH is that it helps to identify inconsistencies and data handling errors in source systems).
Your last idea to convert NULL
s to empty strings is the same issue: what does a NULL
actually mean in the source system? If it means "no suburb" then replacing it with an empty string is probably a good idea, but if it means something else then you should handle it as something else.
So to summarize, my preference would be to import empty strings as-is, and convert NULL
to 'UNKNOWN', but I can't be sure that this actually makes sense in your case. There's no single answer to this question because it all depends on your specific data and what it means. But there's no problem with using NULL
in a data warehouse (or any other database) as long as you do it consistently and with a clear understanding of how the source systems handle data.