python - 解析 CSV 文件并无法解析日期

Question

我正在使用dateutil.parser.parse(date)which 适用于个别日期，但问题是如果我有两天（即 20/04/2032和05/04/1991），那么它将分别读取为 4 月 20 日和 5 月 4 日。

我将阅读形式一致的专栏，但我不会提前知道专栏的形式，因为我希望支持来自多个来源的 CSV。

假设我有一个类似的列表（尽管实际上它们可能更大）：

["01/02/2012", "01/04/2012", "01/05/2012", "01/05/2012", "01/25/2012"]
["01022012", "01042012", "01052012", "01052012", "01252012"]
["01/2nd/2012", etc...]

通过查看列表，我可以看到01肯定是月份，但我需要这个检查是自动的。我不能只检查一个大于的数字，12因为它可能会意外捕获年份子字符串。

我需要从中提取的数据是日期是在月份之前还是之后。这样当我稍后返回并再次解析字符串时，我就知道要抓取什么了。

我似乎无法想出一种以干净有效的方式进行此检查的方法。

score 2 · Accepted Answer

如果您不知道格式，则无法解析。真的就这么简单。这两个 (5/4/91和5/4/1991) 都将通过任何检查有效日期的检查；但只有您可以知道它是 5 月 4 日还是 4 月 5 日，而且只有您知道预期的格式是什么，您才能知道。

最后，您可以期望的最好结果是一个有效的可解析日期列表，一个未解析（但可能是有效的）日期的列表。然后你必须手动检查这两个日期是否有意义。

要获取这些列表：

try:
    maybe_valid.append(dateutil.parser.parse(some_date))
except ValueError:
    probably_invalid.append(some_date)

score 0 · Accepted Answer

# date is a string from the csv file.  
 if len(date) == 8 and all(isdigit(i) for i in date):  
# then it's either the year comes first or last

    date = ["-".join([date[0:4], date[4:6], date[6:8])],
            "-".join([date[0:2], date[2:4], date[4:8])]
           ]
    # only one should be a possible date. if both are and the month
    # combinations don't sort it out then it's too ambigious anyway.

# I'll do something similar for a string of 6 digits. but now I know that all
# strings that are digits only are now seperated.
# I'll also figure out which is the year (or make a reasonable guess and expand
# the number to 8 and make the spaces).

date = date.lower() # or an equivalent if it screws with symbols and numbers

for month in full_length_month_list:
  if month in date:
        # I know I can parse this.

for month in three_letter_month_list:
    if month in date:
        # I know I can parse this.

month_days = {'first':01, 'second':02, 'third':03, ... ,'thirty first': 31}
for string, number in month_days:
    date.replace(string,number)

for shorthand in ['st','nd','rd','th']:
    date.replace(shorthand, '')

date.replace('the','')

# Then I use a regex matcher to get 3 sets of digits: The longest set is the 
# year. The shortest two sets are either date or month. I can get the order
# based on the matcher. Then I can figure out by looping through if there is
# ever a date greater than 12. If so I know the order. I can also possibly
# track the replacements maybe. So that if jan/january was in the original
# string I already know anyway.

这是我想出的潜在路线图，似乎足够合理？有任何重大缺陷或直接缺点吗？鉴于它的目的是用于未知字符串，我可以使用一些帮助将其变成相对强大的东西。

然后我可以使用 {0,1,2} 为年、月和日保留一个集合，然后返回给定元素（年、月、日）不可能的位置并使用集合减法。如果任何集合达到 {}，则无法读取。如果任何集合最后有超过 1 个元素，我会根据优先级在日期/月份上进行本地选择（任何可能假设较低的方差是月份？）。如果只有一个组合，我会将其存储起来，以便稍后读取日期并将其传递给读者。

python - 解析 CSV 文件并无法解析日期

2 回答 2

Related

Reference