python - Python 从文件中提取数据

Question

我有一个文本文件只是说

text1 text2 text text
text text text text

我希望首先计算文件中字符串的数量（全部由空格分隔），然后输出前两个文本。（文本 1 文本 2）

有任何想法吗？

在此先感谢您的帮助

编辑：这是我到目前为止所拥有的：

>>> f=open('test.txt')
>>> for line in f:
    print line
ï»¿text1 text2 text text text text hello
>>> words=line.split()
>>> words
['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']
>>> len(words)
7
if len(words) > 2:
    print "there are more than 2 words"

我遇到的第一个问题是，我的文本文件是： text1 text2 text text text

但是当我拉出单词的输出时，我得到： ['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']

'\xef\xbb\xbf 来自哪里？

score 17 · Accepted Answer

要逐行读取文件，只需循环遍历打开的文件对象for：

for line in open(filename):
    # do something with line

要按空格将一行拆分为单独的单词列表，请使用str.split()：

words = line.split()

要计算 python 列表中的项目数，请使用len(yourlist)：

count = len(words)

要从 python 列表中选择前两项，请使用切片：

firsttwo = words[:2]

我将把构建完整的程序留给你，但你不需要比上面更多的东西，加上一个if声明，看看你是否已经有了你的两个词。

您在文件开头看到的三个额外字节是UTF-8 BOM（字节顺序标记）；它将您的文件标记为 UTF-8 编码，但它是多余的，仅在 Windows 上真正使用。

您可以使用以下方法将其删除：

import codecs
if line.startswith(codecs.BOM_UTF8):
    line = line[3:]

您可能希望使用该编码将字符串解码为 unicode：

line = line.decode('utf-8')

您还可以使用以下命令打开文件codecs.open()：

file = codecs.open(filename, encoding='utf-8')

请注意，这不会codecs.open()为您剥离 BOM；最简单的方法是使用：.lstrip()

import codecs
BOM = codecs.BOM_UTF8.decode('utf8')
with codecs.open(filename, encoding='utf-8') as f:
    for line in f:
        line = line.lstrip(BOM)

python - Python 从文件中提取数据

1 回答 1

Related

Reference