5

我正在尝试确定(因为我的应用程序正在处理来自不同来源和不同时区、格式等的大量数据)如何最好地存储我的数据并使用它。

例如,我应该将所有内容都存储为 UTC 吗?这意味着当我获取数据时,我需要确定它当前所在的时区,如果它不是 UTC,则进行必要的转换以使其如此。(注意,我在 EST)。

然后,在对数据执行计算时,我应该提取(比如说它是 UTC)并进入我的时区(EST),所以当我查看它时是否有意义?我应该将其保留在 UTC 并进行所有计算吗?

很多这些数据是时间序列的,将被绘制成图表,并且图表将在 EST 中。

这是一个 Python 项目,所以假设我有一个数据结构:

"id1": {
    "interval": 60,                            <-- seconds, subDict['interval']
    "last": "2013-01-29 02:11:11.151996+00:00" <-- UTC, subDict['last']
},

我需要对此进行操作,通过确定当前时间(now())是否>最后一个+间隔(已经过去了 60 秒)?所以在代码中:

lastTime = dateutil.parser.parse(subDict['last'])    
utcNow = datetime.datetime.utcnow().replace(tzinfo=tz.tzutc())

if lastTime + datetime.timedelta(seconds=subDict['interval']) < utcNow:
    print "Time elapsed, do something!"

那有意义吗?我在任何地方都在使用 UTC,无论是存储的还是计算的……

此外,如果有人有关于如何在软件中使用时间戳的优秀文章的链接,我很乐意阅读它。可能像 Joel On Software 一样在应用程序中使用时间戳?

4

4 回答 4

3

在我看来,好像您已经在“以正确的方式”做事。用户可能希望在他们的本地时区(输入和输出)中进行交互,但是以 UTC 格式存储标准化日期是正常的,这样它们就不会产生歧义并简化计算。因此,尽快标准化为 UTC,并尽可能晚地进行本地化。

可以在此处找到有关 Python 和时区处理的一些少量信息:

我目前的偏好是将日期作为 unix 时间戳tv_sec值存储在后端存储中,并在处理过程中转换为 Pythondatetime.datetime对象。处理通常使用datetimeUTC 时区中的对象完成,然后在输出前转换为本地用户的时区。我发现拥有丰富的对象,例如datetime.datetime有助于调试。

时区处理起来很麻烦,您可能需要根据具体情况确定是否值得努力正确支持时区。

例如,假设您正在计算所用带宽的每日计数。可能出现的一些问题是:

  1. 夏令时边界会发生什么?您是否应该假设一天总是 24 小时以便于计算,或者您是否需要始终检查每天在夏令时边界上可能有更少或更多小时的计算?
  2. 在呈现本地化时间时,是否重复时间是否重要?例如。如果您在本地时间显示每小时报告而没有附加时区,是否会使用户感到困惑,因为缺少一小时的数据,或者在夏令时更改前后重复一小时的数据。
于 2013-01-30T04:31:35.600 回答
2

Since, as I can see, you do not seem to be having any implementation problems, I would focus rather on design aspects than on code and timestamp format. I have an experience of participating in design of network support for a navigation system implemented as a distributed system in a local network. The nature of that system is such that there is a lot of data (often conflicting), coming from different sources, so solving possible conflicts and keeping data integrity is rather tricky. Just some thoughts based on that experience.

Timestamping data, even in a distributed system including many computers, usually is not a problem if you do not need a higher resoluition than one provided by system time functions and higher time synchronization accuracy than one provided by your OS components.

In the simplest case using UTC is quite reasonable, and for most of tasks it's enough. However, it's important to understand the purpose of using time stamps in your system from the very beginning of design. Time values (no matter if it is Unix time or formatted UTC strings) sometimes may be equal. If you have to resolve data conflicts based on timestamps (I mean, to always select a newer (or an older) value among several received from different sources), you need to understand if an incorrectly resolved conflict (that usually means a conflict that may be resolved in more than one way, as timestamps are equal) is a fatal problem for your system design, or not. The probable options are:

  1. If the 99.99% of conflicts are resolved in the same way on all the nodes, you do not care about the remaining 0.01%, and they do not break data integrity. In that case you may safely continue using something like UTC.

  2. If strict resolving of all the conflicts is a must for you, you have to design your own timestamping system. Timestamps may include time (maybe not system time, but some higher resolution timer), sequence number (to allow producing unique timestamps even if time resolution is not enough for that) and node identifier (to allow different nodes of your system to generate completely unique timestamps).

  3. Finally, what you need may be not timestamps based on time. Do you really need to be able to calculate time difference between a pair of timestamps? Isn't it enough just to allow ordering timestamps, not connecting them to real time moments? If you don't need time calculations, just comparisons, timestamps based on sequential counters, not on real time, are a good choice (see Lamport time for more details).

If you need strict conflict resolving, or if you need very high time resolution, you will probably have to write your own timestamp service.

Many ideas and clues may be borrowed from a book by A. Tanenbaum, "Distributed systems: Principles and paradigms". When I faced such problems, it helped me a lot, and there is a separate chapter dedicated to timestamps generation in it.

于 2013-01-30T06:09:30.990 回答
1

我认为最好的方法是将所有时间戳数据存储为 UTC。当您读入时,立即转换为UTC;在显示之前,将 UTC 转换为您当地的时区。

您甚至可能想让您的代码两次打印所有时间戳,一次在本地时间,第二次在 UTC 时间……这取决于您需要一次在屏幕上显示多少数据。

我是 RFC 3339 时间戳格式的忠实粉丝。它对人类和机器来说都是明确的。最好的一点是几乎没有什么是可选的,所以它看起来总是一样的:

2013-01-29T19:46:00.00-08:00

我更喜欢将时间戳转换为单个浮点值以进行存储和计算,然后再转换回日期时间格式以进行显示。我不会在浮点数中存钱,但时间戳值在浮点值的精度范围内!

使用时间浮点数使很多代码变得非常容易:

if time_now() >= last_time + interval:
    print("interval has elapsed")

看起来你已经在这样做了,所以我不能建议任何显着的改进。

我编写了一些库函数来将时间戳解析为 Python 时间浮点值,并将时间浮点值转换回时间戳字符串。也许这里的一些东西对你有用:

http://home.blarg.net/~steveha/pyfeed.html

建议你看看feed.date.rfc3339。BSD 许可证,因此您可以根据需要使用代码。

编辑:问题:这对时区有什么帮助?

答:如果您存储的每个时间戳都以 UTC 时间存储为 Python 时间浮点值(自纪元以来的秒数,带有可选的小数部分),您可以直接比较它们;从另一个中减去一个以找出它们之间的间隔;等等。如果您使用 RFC 3339 时间戳,那么每个时间戳字符串在时间戳字符串中都有时区,并且可以通过您的代码正确地将其转换为 UTC 时间。如果您在显示之前将浮点值转换为时间戳字符串值,则时区对于本地时间将是正确的。

另外,正如我所说,看起来他已经在做这件事了,所以我不认为我可以提供任何惊人的建议。

于 2013-01-30T03:54:07.027 回答
1

我个人使用的是 Unix-time 标准,由于其简单的表示形式,存储起来非常方便,它只是一个数字序列。由于它在内部表示 UTC 时间,因此您必须确保在存储之前正确生成它(从其他时间戳转换)并根据您想要的任何时区对其进行格式化。

一旦您在后端数据中有一个通用的时间戳格式(tz 感知),绘制数据就非常容易,只需设置目标 TZ。

举个例子:

import time
import datetime
import pytz
# print pre encoded date in your local time from unix epoch
example = {"id1": {
                   "interval": 60,
                   "last": 1359521160.62
                   }
           }
#this will use your system timezone formatted
print time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(example['id1']['last']))
#this will use ISO country code to localize the timestamp
countrytz = pytz.country_timezones['BR'][0]
it = pytz.timezone(countrytz)
print  it.localize(datetime.datetime.utcfromtimestamp(example['id1']['last']))
于 2013-01-30T05:19:24.643 回答