# date is a string from the csv file.
if len(date) == 8 and all(isdigit(i) for i in date):
# then it's either the year comes first or last
date = ["-".join([date[0:4], date[4:6], date[6:8])],
"-".join([date[0:2], date[2:4], date[4:8])]
]
# only one should be a possible date. if both are and the month
# combinations don't sort it out then it's too ambigious anyway.
# I'll do something similar for a string of 6 digits. but now I know that all
# strings that are digits only are now seperated.
# I'll also figure out which is the year (or make a reasonable guess and expand
# the number to 8 and make the spaces).
date = date.lower() # or an equivalent if it screws with symbols and numbers
for month in full_length_month_list:
if month in date:
# I know I can parse this.
for month in three_letter_month_list:
if month in date:
# I know I can parse this.
month_days = {'first':01, 'second':02, 'third':03, ... ,'thirty first': 31}
for string, number in month_days:
date.replace(string,number)
for shorthand in ['st','nd','rd','th']:
date.replace(shorthand, '')
date.replace('the','')
# Then I use a regex matcher to get 3 sets of digits: The longest set is the
# year. The shortest two sets are either date or month. I can get the order
# based on the matcher. Then I can figure out by looping through if there is
# ever a date greater than 12. If so I know the order. I can also possibly
# track the replacements maybe. So that if jan/january was in the original
# string I already know anyway.
这是我想出的潜在路线图,似乎足够合理?有任何重大缺陷或直接缺点吗?鉴于它的目的是用于未知字符串,我可以使用一些帮助将其变成相对强大的东西。
然后我可以使用 {0,1,2} 为年、月和日保留一个集合,然后返回给定元素(年、月、日)不可能的位置并使用集合减法。如果任何集合达到 {},则无法读取。如果任何集合最后有超过 1 个元素,我会根据优先级在日期/月份上进行本地选择(任何可能假设较低的方差是月份?)。如果只有一个组合,我会将其存储起来,以便稍后读取日期并将其传递给读者。