python - Python3.0：标记化 & BytesIO

Question

在 python3.0 中尝试tokenize字符串时，为什么'utf-8'在标记开始之前我会得到一个前导？

从python3 文档，tokenize现在应该按如下方式使用：

g = tokenize(BytesIO(s.encode('utf-8')).readline)

但是，在终端尝试此操作时，会发生以下情况：

>>> from tokenize import tokenize
>>> from io import BytesIO
>>> g = tokenize(BytesIO('foo'.encode()).readline)
>>> next(g)
(57, 'utf-8', (0, 0), (0, 0), '')
>>> next(g)
(1, 'foo', (1, 0), (1, 3), 'foo')
>>> next(g)
(0, '', (2, 0), (2, 0), '')
>>> next(g)

utf-8其他令牌之前的令牌是什么？这应该发生吗？如果是这样，那么我应该总是跳过第一个令牌吗？

[编辑]

我发现令牌类型 57 是tokenize.ENCODING，如果需要，可以轻松地将其从令牌流中过滤掉。

score 2 · Accepted Answer

那是源的编码cookie。您可以明确指定一个：

# -*- coding: utf-8 -*-
do_it()

否则，Python 采用默认编码，Python 3 中的 utf-8。

python - Python3.0：标记化 & BytesIO

[编辑]

1 回答 1

Related

Reference