我有一个包含“日期”列的数据集,其中包含多种格式的日期,包括:
- 2018.05.07
- 2018 年 6 月 1 日
- 2018 年 6 月 1 日报道
- 2018 年 6 月
- 2018
- 1970年之前
- 1941-1945
- 钙。1960
也有无效的日期,例如:
- 190Feb-2010
我正在尝试查找具有确切日期(日、月和年)的日期并将它们转换为日期时间。我还需要在字段中排除带有“报告”的日期。有什么方法可以过滤此类数据,而无需在所有可能的日期格式之前找到?
使用 dateutil 库。
if 语句检查是否缺少日期(月、年、日期)的任何部分,如果是,则避免它。
fuzzy=True如果想从“Reported 01 Jun 2018”等字符串中提取日期,请使用
import dateutil.parser
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formated_date = []
for date in dates:
try:
if dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2015, 1, 1)) == dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2016, 2, 2)):
formated_date.append(yourdate)
except:
continue
另一种解决方案。这是使用每种格式检查每个日期的蛮力方法。继续添加更多格式以使其适用于任何日期格式。但这是一种耗时的方法。
import datetime
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formats = ["%Y%m%d","%Y.%m.%d","%Y-%m-%d","%Y/%m/%d","%Y%a%d","%Y.%a.%d","%Y-%a-%d","%Y%A%d","%Y.%A.%d","%Y-%A-%d",
"%d-%m-%Y","%d.%m.%Y","%d%m%Y","%d/%m/%Y","%d-%b-%Y","%d%b%Y","%d.%b.%Y","%d/%b/%Y"]
formated_date = []
for date in dates:
for fmt in formats:
try:
dt = datetime.datetime.strptime(date,fmt)
formated_date.append(dt)
except:
continue
In [1]: string_with_dates = """entries are due by January 4th, 2017 at 8:00pm created 01/15/2005 by ACME Inc. and associates."""
In [2]: import datefinder
In [3]: matches = datefinder.find_dates(string_with_dates)
In [4]: for match in matches:
...: print match
2017-01-04 20:00:00
2005-01-15 00:00:00
希望这可以帮助您从带有日期的字符串中查找日期