0

我正在使用dateutil.parser.parse(date)which 适用于个别日期,但问题是如果我有两天(即 20/04/203205/04/1991),那么它将分别读取为 4 月 20 日和 5 月 4 日。

我将阅读形式一致的专栏,但我不会提前知道专栏的形式,因为我希望支持来自多个来源的 CSV。

假设我有一个类似的列表(尽管实际上它们可能更大):

  1. ["01/02/2012", "01/04/2012", "01/05/2012", "01/05/2012", "01/25/2012"]
  2. ["01022012", "01042012", "01052012", "01052012", "01252012"]
  3. ["01/2nd/2012", etc...]

通过查看列表,我可以看到01肯定是月份,但我需要这个检查是自动的。我不能只检查一个大于的数字,12因为它可能会意外捕获年份子字符串。

我需要从中提取的数据是日期是在月份之前还是之后。这样当我稍后返回并再次解析字符串时,我就知道要抓取什么了。

我似乎无法想出一种以干净有效的方式进行此检查的方法。

4

2 回答 2

2

如果您不知道格式,则无法解析。真的就这么简单。这两个 (5/4/915/4/1991) 都将通过任何检查有效日期的检查;但只有您可以知道它是 5 月 4 日还是 4 月 5 日,而且只有您知道预期的格式是什么,您才能知道。

最后,您可以期望的最好结果是一个有效的可解析日期列表,一个未解析(但可能是有效的)日期的列表。然后你必须手动检查这两个日期是否有意义。

要获取这些列表:

try:
    maybe_valid.append(dateutil.parser.parse(some_date))
except ValueError:
    probably_invalid.append(some_date)
于 2013-08-04T06:10:06.240 回答
0
# date is a string from the csv file.  
 if len(date) == 8 and all(isdigit(i) for i in date):  
# then it's either the year comes first or last

    date = ["-".join([date[0:4], date[4:6], date[6:8])],
            "-".join([date[0:2], date[2:4], date[4:8])]
           ]
    # only one should be a possible date. if both are and the month
    # combinations don't sort it out then it's too ambigious anyway.

# I'll do something similar for a string of 6 digits. but now I know that all
# strings that are digits only are now seperated.
# I'll also figure out which is the year (or make a reasonable guess and expand
# the number to 8 and make the spaces).

date = date.lower() # or an equivalent if it screws with symbols and numbers

for month in full_length_month_list:
  if month in date:
        # I know I can parse this.

for month in three_letter_month_list:
    if month in date:
        # I know I can parse this.

month_days = {'first':01, 'second':02, 'third':03, ... ,'thirty first': 31}
for string, number in month_days:
    date.replace(string,number)

for shorthand in ['st','nd','rd','th']:
    date.replace(shorthand, '')

date.replace('the','')

# Then I use a regex matcher to get 3 sets of digits: The longest set is the 
# year. The shortest two sets are either date or month. I can get the order
# based on the matcher. Then I can figure out by looping through if there is
# ever a date greater than 12. If so I know the order. I can also possibly
# track the replacements maybe. So that if jan/january was in the original
# string I already know anyway.

这是我想出的潜在路线图,似乎足够合理?有任何重大缺陷或直接缺点吗?鉴于它的目的是用于未知字符串,我可以使用一些帮助将其变成相对强大的东西。

然后我可以使用 {0,1,2} 为年、月和日保留一个集合,然后返回给定元素(年、月、日)不可能的位置并使用集合减法。如果任何集合达到 {},则无法读取。如果任何集合最后有超过 1 个元素,我会根据优先级在日期/月份上进行本地选择(任何可能假设较低的方差是月份?)。如果只有一个组合,我会将其存储起来,以便稍后读取日期并将其传递给读者。

于 2013-08-05T02:25:39.817 回答