python - 将 Contents Tarfile 读入 Python - “不允许向后搜索”

Question

我是 python 新手。我无法将 tarfile 的内容读入 python。

数据是期刊文章的内容（托管在 pubmed Central）。请参阅下面的信息。并链接到我想读入 Python 的 tarfile。

http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61 -65.tar.gz

我有一个类似的 .tar.gz 文件列表，我最终也想读入。我认为（知道）所有的 tarfile 都有一个与之关联的 .nxml 文件。这是我实际上对提取/阅读感兴趣的 .nxml 文件的内容。愿意接受有关最佳方法的任何建议...

如果我将 tar 文件保存到我的 PC，这就是我所拥有的。一切按预期运行。

tarfile_name = "F:/PMC_OA_TextMining/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
tfile = tarfile.open(tarfile_name)

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

我今天了解到，为了直接从 pubmed centrals FTP 站点访问 tar 文件，我必须使用urllib. 以下是修改后的代码（以及我收到的 stackoverflow 答案的链接）：

将 .tar.gz 文件的内容从网站读取到 python 3.x 对象中

tarfile_name = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_name)
tfile = tarfile.open(fileobj=ftpstream, mode="r|gz")

但是，当我运行剩余的代码（如下）时，我收到一条错误消息（“不允许向后搜索”）。怎么来的？

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

代码在最后一行失败，我尝试读取与我的 tar 文件关联的 .nxml 内容。以下是我收到的实际错误消息。这是什么意思？读取/访问这些都嵌入在 tar 文件中的 .nxml 文件的内容的最佳解决方法是什么？

Traceback (most recent call last):
File "F:\PMC_OA_TextMining\test2.py", line 135, in <module>
tfile_extract1_text = tfile_extract1.read()
File "C:\Python30\lib\tarfile.py", line 804, in read
buf += self.fileobj.read()
File "C:\Python30\lib\tarfile.py", line 715, in read
return self.readnormal(size)
File "C:\Python30\lib\tarfile.py", line 722, in readnormal
self.fileobj.seek(self.offset + self.position)
File "C:\Python30\lib\tarfile.py", line 531, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed

在此先感谢您的帮助。克里斯

score 13 · Accepted Answer

出了什么问题： Tar 文件是交错存储的。它们按标题、数据、标题、数据、标题、数据等顺序出现。当您使用枚举文件时getmembers()，您已经阅读了整个文件以获取标题。然后，当您要求 tarfile 对象读取数据时，它会尝试从最后一个标头向后搜索到第一个数据。但是如果不关闭并重新打开 urllib 请求，您就无法在网络流中向后搜索。

如何解决它：您需要下载文件，将临时副本保存到磁盘或 StringIO，枚举此临时副本中的文件，然后提取所需的文件。

#!/usr/bin/env python3
from io import BytesIO
import urllib.request
import tarfile

tarfile_url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_url)

# BytesIO creates an in-memory temporary file.
# See the Python manual: http://docs.python.org/3/library/io.html
tmpfile = BytesIO()
while True:
    # Download a piece of the file from the connection
    s = ftpstream.read(16384)

    # Once the entire file has been downloaded, tarfile returns b''
    # (the empty bytes) which is a falsey value
    if not s:  
        break

    # Otherwise, write the piece of the file to the temporary file.
    tmpfile.write(s)
ftpstream.close()

# Now that the FTP stream has been downloaded to the temporary file,
# we can ditch the FTP stream and have the tarfile module work with
# the temporary file.  Begin by seeking back to the beginning of the
# temporary file.
tmpfile.seek(0)

# Now tell the tarfile module that you're using a file object
# that supports seeking backward.
# r|gz forbids seeking backward; r:gz allows seeking backward
tfile = tarfile.open(fileobj=tmpfile, mode="r:gz")

# You want to limit it to the .nxml files
tfile_members2 = [filename
                  for filename in tfile.getnames()
                  if filename.endswith('.nxml')]

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

# And when you're done extracting members:
tfile.close()
tmpfile.close()

score 0 · Accepted Answer

我在尝试requests.get文件时遇到了同样的错误，所以我将所有内容提取到 tmp 目录而不是使用BytesIO, 或extractfile(member)：

# stream == requests.get
inputs = [tarfile.open(fileobj=LZMAFile(stream), mode='r|')]
t = "/tmp"
for tarfileobj in inputs:        
    tarfileobj.extractall(path=t, members=None)
for fn in os.listdir(t):
    with open(os.path.join(t, fn)) as payload:
        print(payload.read())

score 0 · Accepted Answer

一个非常简单的解决方案是更改 tarfile 读取文件的方式，而不是：

tfile = tarfile.open(tarfile_name)

改成：

with tarfile.open(fileobj=f, mode='r:*') as tar:

重要的部分是将'：'放在模式中。

您也可以查看此答案以了解更多信息

python - 将 Contents Tarfile 读入 Python - “不允许向后搜索”

3 回答 3

Related

Reference