0

我有 2 个文档:第一个是 .txt 文档,它是这样制作的字典:

Box OB
Table OB
Tiger AN
Lion AN

第二个文档是一个 .txt 文件,里面有一个长文本。像这个。

在一个盒子里。那个盒子里有一只狮子和一只老虎。

我想列出我的字典中的单词在我的文本中出现的次数。

有点像这样:

Box: 2 
Lion: 1 
Tiger: 1

这就是我所做的:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs


file = codecs.open("MYtext.txt",'r','utf-8')
text = file.readlines()
line_list = []

for line in text:
    line.rstrip('\n')
    line_list.append(line)

d = {}
import nltk 
with open("MYdict.txt",) as mydict:
    for line in mydict:
        (key, val) = line.split()
        dictionary = dict(line.strip().split(None, 1) for line in mydict)

line_counter = 0
for line in line_list:
    line_counter = line_counter + 1

for word in line.split():
    if word in line_list in dictionary.keys():
        line_list = dictionary[word]
        line_list.append(line_counter)
        dictionary[word] = line_list
    else:
        line_list = []
        line_list.append(line_counter)
        dictionary[word] = line_list
for key in sorted(dictionary.keys()):
    print key, len(dictionary[key])

我收到此错误

    $ /var/folders/3h/w3_12zfs7hs6zcrlnpk8gdg40000gn/T/Cleanup\ At\ Startup/test\ 44-405955317.432.py.command ; exit;
Traceback (most recent call last):
  File "/private/var/folders/3h/w3_12zfs7hs6zcrlnpk8gdg40000gn/T/Cleanup At Startup/test 44-405955317.367.py", line 33, in <module>
    for key in sorted(dictionary.keys()):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
logout

[Process completed]

你能帮忙吗?我是新手。我是语言学家,不是程序员。

4

2 回答 2

0

您遇到的错误可能与"MYdict.txt". 我认为,如果您像对其他文件一样使用codecs.open带有标志的方法,则可以解决此问题。'utf-8'

如果我正确理解你喜欢做什么,我想我会这样处理它:

import codecs

with codecs.open('MYdict.txt',  'r', 'utf-8') as f:
    wordslist = [line.split()[0].lower() for line in f]

with codecs.open('MYtext.txt',  'r', 'utf-8') as f:
    text = f.read().lower()

counts = {}
for word in wordslist:
    counts[word] = text.count(word)


# alternatively instead of the last 3 lines
# you can use a "dictionary comprehension"
counts = {word: text.count(word) for word in wordslist}

为了漂亮地打印输出,您可以使用:

import pprint

pprint.pprint(counts)
于 2013-11-12T13:50:19.027 回答
0

您的代码有几处错误,最初与您遇到的错误无关。

您应该将您import的 s 分组在文件的顶部。该import nltk行不应位于代码中间

你应该先处理字典。关于这一点,您有一个外部循环 ( for line in mydict),然后在内部有另一个循环(实际上是一个列表理解)。不好。您可以简单地使用:

with open("MYdict.txt",) as mydict:
    dictionary = dict(line.strip().split(None, 1) for line in mydict)

但是,最好将字符串保存为小写:

with open("MYdict.txt",) as mydict:
    dictionary = {x[0].lower(): x[1] for x in [line.strip().split(None, 1) for line in mydict]}

为了从文本中读取、剥离和存储行,您可以使用splitlines字符串的方法,如下所示:

with codecs.open("MYtext.txt",'r','utf-8') as mytext:
    line_list = mytext.read().splitlines()

但是,最好逐行处理文件,而不是保存所有行。

无需使用for循环来计算行数。只需使用len(line_list).

我不太明白你在代码的最后部分做了什么。您似乎弄乱了一些以前的变量(例如line上一个循环中的变量)并覆盖了该line_list变量。

这是一种方法:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs

with open("MYdict.txt",) as mydict:
    dictionary = {x[0].lower(): x[1] for x in [line.strip().split(None, 1) for line in mydict]}

word_count = {}

with codecs.open("MYtext.txt",'r','utf-8') as mytext:
    for line in mytext:
        for word in line.strip().split():
            word = word.rstrip('.,')
            if word in dictionary.keys():
                word_count[word] = word_count.get(word, 0) + 1

for key in sorted(word_count, key=word_count.get, reverse=True):
    print "%s : %i" % (key, word_count[key])

您当然可以将两个 for 循环合并为一个,只需使用for word in (line.strip().split() for line in mytext)

于 2013-11-12T13:21:18.717 回答