python - 如何从文件中读取文件内容？

Question

使用 Python3，希望对os.walk一个文件目录，将它们读入二进制对象（字符串？）并对它们做一些进一步的处理。第一步，虽然：如何读取文件结果os.walk？

# NOTE: Execute with python3.2.2

import os
import sys

path = "/home/user/my-files"

count = 0
successcount = 0
errorcount = 0
i = 0

#for directory in dirs
for (root, dirs, files) in os.walk(path):
 # print (path)
 print (dirs)
 #print (files)

 for file in files:

   base, ext = os.path.splitext(file)
   fullpath = os.path.join(root, file)

   # Read the file into binary? --------
   input = open(fullpath, "r")
   content = input.read()
   length = len(content)
   count += 1
   print ("    file: ---->",base," / ",ext," [count:",count,"]",  "[length:",length,"]")
   print ("fullpath: ---->",fullpath)

错误：

Traceback (most recent call last):
  File "myFileReader.py", line 41, in <module>
    content = input.read()
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 11: invalid continuation byte

score 9 · Accepted Answer

要读取二进制文件，您必须以二进制模式打开文件。改变

input = open(fullpath, "r")

至

input = open(fullpath, "rb")

read() 的结果将是一个 bytes() 对象。

score 3 · Accepted Answer

As some of your files are binary, they cannot be successfully decoded into unicode characters that Python 3 uses to store all strings in the interpreter. Note a large change between Python 2 and Python 3 involves the migration of the representation of Strings to unicode characters from ASCII, which means that each character cannot simply be treated as a byte (yes, text strings in Python 3 require either 2x or 4x as much memory to store as Python 2, as UTF-8 uses up to 4 bytes per character).

You thus have a number of options that will depend upon your project:

Ignore binary files, filtering by the file extension,
Read the binary files and either catch the decoding exception if and when it occurs, and skip the file, or use one of the method described in this thread How can I detect if a file is binary (non-text) in python?

In this vein, you may edit your solution to simply catch the UnicodeDecode error and skip the file.

Regardless of your decision, it is important to note that if there is a wide range of different character encodings in the files on your system, you will need to specify the encoding as Python 3.0 will assume the characters are encoded in UTF-8.

As a reference, a great presentation on Python 3 I/O: http://www.dabeaz.com/python3io/MasteringIO.pdf

python - 如何从文件中读取文件内容？

2 回答 2

Related

Reference