我有 2 个文档:第一个是 .txt 文档,它是这样制作的字典:
Box OB
Table OB
Tiger AN
Lion AN
第二个文档是一个 .txt 文件,里面有一个长文本。像这个。
在一个盒子里。那个盒子里有一只狮子和一只老虎。
我想列出我的字典中的单词在我的文本中出现的次数。
有点像这样:
Box: 2
Lion: 1
Tiger: 1
这就是我所做的:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
file = codecs.open("MYtext.txt",'r','utf-8')
text = file.readlines()
line_list = []
for line in text:
line.rstrip('\n')
line_list.append(line)
d = {}
import nltk
with open("MYdict.txt",) as mydict:
for line in mydict:
(key, val) = line.split()
dictionary = dict(line.strip().split(None, 1) for line in mydict)
line_counter = 0
for line in line_list:
line_counter = line_counter + 1
for word in line.split():
if word in line_list in dictionary.keys():
line_list = dictionary[word]
line_list.append(line_counter)
dictionary[word] = line_list
else:
line_list = []
line_list.append(line_counter)
dictionary[word] = line_list
for key in sorted(dictionary.keys()):
print key, len(dictionary[key])
我收到此错误
$ /var/folders/3h/w3_12zfs7hs6zcrlnpk8gdg40000gn/T/Cleanup\ At\ Startup/test\ 44-405955317.432.py.command ; exit;
Traceback (most recent call last):
File "/private/var/folders/3h/w3_12zfs7hs6zcrlnpk8gdg40000gn/T/Cleanup At Startup/test 44-405955317.367.py", line 33, in <module>
for key in sorted(dictionary.keys()):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
logout
[Process completed]
你能帮忙吗?我是新手。我是语言学家,不是程序员。