0

我有一个数据集,我需要过滤存储为字符串的日期(将源列更改为 DateTime 不是一个选项,此数据来自我无法控制的第 3 方源)。

其中一个日期格式不正确,因此如果我执行以下查询,我会得到一个结果

select ClientID, StartDate from boarding_appts where isdate(StartDate) = 0

ClientID   StartDate
---------- --------------------
5160       5/6/210 12:00:00

如果我这样做,cast(StartDate as datetime)我会得到“将表达式转换为数据类型日期时间的算术溢出错误。 ”,这是我所期望的。IsDate如果我单独过滤一切正常

select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age  from boarding_appts where isdate(StartDate) = 1

ClientID   StartDate               age
---------- ----------------------- ----------
10207      2012-06-09 12:00:00.000 1
2843       2012-06-23 12:00:00.000 1
2843       2012-06-23 12:00:00.000 1
8292       2012-05-11 12:00:00.000 1
7935       2012-04-24 12:00:00.000 1
... (1000's of more rows) ...

这是我的问题:

我想过滤掉记录,所以只显示一年或更新的记录,但是无论我如何尝试执行过滤器,每个查询都会给我一个算术溢出错误。

select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age  
from boarding_appts 
where isdate(StartDate) = 1 
    and datediff(year, cast(StartDate as dateTime), getdate()) < 1 --If you comment out this line it works fine

select * 
from (select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age  from boarding_appts where isdate(StartDate) = 1) as Filtered
where age < 1 --If you comment out this line it works fine

select * 
from (select ClientID, cast(StartDate as dateTime) as StartDateCast  from boarding_appts where isdate(StartDate) = 1) as Filtered
where datediff(year, StartDateCast, getdate()) < 1 --If you comment out this line it works fine

;with Filtered as
(select ClientID, cast(StartDate as dateTime) as StartDateCast  from boarding_appts where isdate(StartDate) = 1)
select * from Filtered
where datediff(year, StartDateCast, getdate()) < 1 --If you comment out this line it works fine

;with Filtered as
(select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age  from boarding_appts where isdate(StartDate) = 1)
select * from Filtered
where age < 1 --If you comment out this line it works fine

这是SQL Fiddle 上的一组测试数据,供您尝试任何解决方案。我对如何解决这个问题没有想法。我能想到的唯一可行的解​​决方案是先选择一个临时表,然后再选择它

select ClientID, StartDate, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age  
into #t
from boarding_appts 
where isdate(StartDate) = 1 

select * from #t where age < 1 --Works.
4

2 回答 2

4

SQL 是一种声明性语言。SQL 优化器可以自由地重新排列where子句的某些部分,只要它保留其原始含义即可。因此,即使您先指定,它也可以在datediff之前运行。子查询或 CTE 并不能提供一定的缓解,因为它也可以被重写。isdateisdate

Aaron Bertrand 在评论中的第二个建议:

WHERE   CASE ISDATE(StartDate) 
        WHEN 1 THEN StartDate 
        ELSE '19000101' 
        END >= DATEADD(YEAR, -1, GETDATE());

使得 SQL Server 不太可能StartDateISDATE = 0. 这似乎是最好的解决方案。

我已经标记了这个答案社区 wiki,如果 Aaran Bertrand 发布了答案,请接受 :)

于 2013-05-10T17:12:24.743 回答
2

SQL Server's DateTime has the domain 1753-01-01 00:00:00.000 ≤ x ≤ 9999-12-31 23:59:59.997. The year 210 CE is outside that domain. Hence the problem.

If you were using SQL Server 2008 or later, you could cast it to a DateTime2 datatype and you'd be golden (its domain is 0001-01-01 00:00:00.0000000 &le x ≤ 9999-12-31 23:59:59.9999999. But with SQL Server 2005, you're pretty much SOL.

This is really a problem of data cleaning. My inclination in cases like this is to load the 3rd party data into a staging table with each field as character strings. Then cleanse the data in place, replacing, for instance, invalid dates with NULL. Once cleansed, then do the necessary conversion work to move it to its final destination.

Another approach is to use pattern matching and do the date filtering without converting anything to datetime. ISO 8601 date/time values are character strings that have the laudable property of being (A) human-readable and (B) collating and comparing properly.

What I've done in the past is some analytical work to identify all the patterns in the datetime field by replacing decimal digits with a 'd' and then running group by to compute the counts of each different pattern found. Once you have that you can create some pattern tables to guide you. Something like these:

create table #datePattern
(
  pattern varchar(64) not null primary key clustered ,
  monPos  int         not null ,
  monLen  int         not null ,
  dayPos  int         not null ,
  dayLen  int         not null ,
  yearPos int         not null ,
  yearLen int         not null ,
)

insert #datePattern values ( '[0-9]/[0-9]/[0-9] %'                          ,1,1,3,1,5,1)
insert #datePattern values ( '[0-9]/[0-9]/[0-9][0-9] %'                     ,1,1,3,1,5,2)
insert #datePattern values ( '[0-9]/[0-9]/[0-9][0-9][0-9] %'                ,1,1,3,1,5,3)
insert #datePattern values ( '[0-9]/[0-9]/[0-9][0-9][0-9][0-9] %'           ,1,1,3,1,5,4)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9] %'                     ,1,1,3,2,6,1)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9][0-9] %'                ,1,1,3,2,6,2)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9][0-9][0-9] %'           ,1,1,3,2,6,3)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] %'      ,1,1,3,2,6,4)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9] %'                     ,1,2,4,1,6,1)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9][0-9] %'                ,1,2,4,1,6,2)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9][0-9][0-9] %'           ,1,2,4,1,6,3)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9][0-9][0-9][0-9] %'      ,1,2,4,1,6,4)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9] %'                ,1,2,4,2,7,1)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9][0-9] %'           ,1,2,4,2,7,2)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9] %'      ,1,2,4,2,7,3)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] %' ,1,2,4,2,7,4)

create table #timePattern
(
  pattern varchar(64) not null primary key clustered ,
  hhPos int not null ,
  hhLen int not null ,
  mmPos int not null ,
  mmLen int not null ,
  ssPos int not null ,
  ssLen int not null ,
)
insert #timePattern values ( '[0-9]:[0-9]:[0-9]'                ,1,1,3,1,5,1 )
insert #timePattern values ( '[0-9]:[0-9]:[0-9][0-9]'           ,1,1,3,1,5,2 )
insert #timePattern values ( '[0-9]:[0-9][0-9]:[0-9]'           ,1,1,3,2,6,1 )
insert #timePattern values ( '[0-9]:[0-9][0-9]:[0-9][0-9]'      ,1,1,3,2,6,2 )
insert #timePattern values ( '[0-9][0-9]:[0-9]:[0-9]'           ,1,2,4,1,6,1 )
insert #timePattern values ( '[0-9][0-9]:[0-9]:[0-9][0-9]'      ,1,2,4,1,6,2 )
insert #timePattern values ( '[0-9][0-9]:[0-9][0-9]:[0-9]'      ,1,2,4,2,7,1 )
insert #timePattern values ( '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]' ,1,2,4,2,7,2 )

You could combine these two tables into 1 but the number of combinations tends to explode things, though it greatly simplifies the query then.

Once you have that, the query is [fairly] easy, given that SQL is not exactly the world's best language choice for string processing:

---------------------------------------------------------------------
-- first, get your lower bound in ISO 8601 format yyyy-mm-dd hh:mm:ss
-- This will compare/collate properly
---------------------------------------------------------------------
declare @dtLowerBound varchar(255)
set @dtLowerBound = convert(varchar,dateadd(year,-1,current_timestamp),121)

-----------------------------------------------------------------
-- select rows with a start date more recent than the lower bound
-----------------------------------------------------------------
select isoDate =       + right( '0000' + substring( t.startDate , coalesce(dt.yearPos,1) , coalesce(dt.YearLen,0) ) , 4 )
                 + '-' + right(   '00' + substring( t.startDate , coalesce(dt.monPos,1)  , coalesce(dt.MonLen,0)  ) , 2 )
                 + '-' + right(   '00' + substring( t.startDate , coalesce(dt.dayPos,1)  , coalesce(dt.dayLen,0)  ) , 2 )
                 + case
                   when tm.pattern is not null then
                       ' ' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.hhPos , tm.hhLen ) , 2 )
                     + ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.mmPos , tm.mmLen ) , 2 )
                     + ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.ssPos , tm.ssLen ) , 2 )
                   else ''
                   end
,*
from someTableWithBadData t
left join #datePattern dt on t.startDate like dt.pattern
left join #timePattern tm on ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) )
                             like tm.pattern
where @lowBound <=        + right( '0000' + substring( t.startDate , coalesce(dt.yearPos,1) , coalesce(dt.YearLen,0) ) , 4 )
                 + '-' + right(   '00' + substring( t.startDate , coalesce(dt.monPos,1)  , coalesce(dt.MonLen,0)  ) , 2 )
                 + '-' + right(   '00' + substring( t.startDate , coalesce(dt.dayPos,1)  , coalesce(dt.dayLen,0)  ) , 2 )
                 + case
                   when tm.pattern is not null then
                       ' ' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.hhPos , tm.hhLen ) , 2 )
                     + ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.mmPos , tm.mmLen ) , 2 )
                     + ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.ssPos , tm.ssLen ) , 2 )
                   else ''
                   end

Like I said, SQL not the best choice for munging strings.

This should get you ... 90% there. Experience tells me that you'll still find more bad dates: months less than 1 or greater than 12 , days less than 1 or greater than 31, or days out of range for that month (nothing like February 31st to make the computer whine), etc. Old cobol programs in particular, loved to use a field of all 9s to indicate missing data, for instance (though that is an easy case to deal with).

My preferred technique is to write a perl script to scrub the data and bulk load it into SQL Server, using perl's BCP facilities. That's exactly the sort of problem space perl is designed for.

于 2013-05-10T19:23:08.717 回答