0

我得到了一个带有相关 DB2 控制文件的 DB2 数据导出(大约 7 GB)。我的目标是将所有数据上传到 Oracle 数据库中。我几乎成功了——我将控制文件转换为 SQL*Loader CTL 文件,并且大部分时间都有效。

但是,我发现一些数据文件在某些​​列中包含终止符和垃圾数据,这些数据被加载到数据库中,从而导致与该数据的匹配出现明显问题。例如,A 列应包含“9930027130”,将显示长度(trim(col))= 14:4 字节的垃圾数据。

我的问题是,从系统中消除这些垃圾数据的最佳方法是什么?我希望对 CTL 文件有一个简单的补充,允许它用空格替换垃圾 - 否则我只能考虑编写一个脚本来分析数据并在运行 SQL*Loader 之前用空格替换空值/垃圾。

4

2 回答 2

2

What, exactly, is your definition of "junk"?

If you know that a column should only contain 10 characters of data, for example, you can add a NULLIF( LENGTH( <<column>> ) > 10 ) to your control file. If you know that the column should only contain numeric characters (or alphanumerics), you can write a custom data cleansing function (i.e. STRIP_NONNUMERIC) and call that from your control file, i.e.

COLUMN_NAME  position(1:14)  CHAR "STRIP_NONNUMERIC(:LAST_NAME)",

Depending on your requirements, these cleansing functions and the cleansing logic can get rather complicated. In data warehouses that are loading and cleansing large amounts of data every night, data is generally moved through a series of staging tables as successive rounds of data cleansing and validation rules are applied rather than trying to load and cleanse all the data in a single step. A common approach would be, for example, to load all the data into VARCHAR2(4000) columns with no cleansing via SQL*Loader (or external tables). Then you'd have a separate process move the data to a staging table that has the proper data types NULL-ing out data that couldn't be converted (i.e. non-numeric data in a NUMBER column, impossible dates, etc.). Another process would come along and move the data to another staging table where you apply domain rules-- things like a social security number has to be 9 digits, a latitude has to be between -90 and 90 degrees, or a state code has to be in the state lookup table. Depending on the complexity of the validations, you may have more processes that move the data to additional staging tables to apply ever stricter sets of validation rules.

于 2011-02-04T00:45:19.857 回答
1

“一列应包含'9930027130',将显示长度(trim(col))= 14:4字节的垃圾数据。”

执行 SELECT DUMP(col) 来确定奇怪的字符。然后决定它是否总是无效、在某些情况下有效或有效但解释错误。

于 2011-02-04T02:06:09.743 回答