python - 使用python将两个单词和类别列表与自己的语料库链接起来

Question

好的，我一遍又一遍地考虑它，但我只是python的初学者，我没有找到任何解决方案。这就是我需要做的：我有一个来自 LIWC 的文本文件，后面有各种荷兰语单词和数字：

aaien 12 13 32
aan 10
aanbad 12 13 14 57 58 38
...

然后我从 LIWC 得到一个文本文件，后面有一个数字和一个类别：

01:Pronoun
02:I
03:We
04:Self
05:You
06:Other
...

现在我应该将我自己的语料库与荷兰语单词与这些类别联系起来。所以首先我必须将我的荷兰语单词与 LIWC 单词列表中荷兰语单词后面的数字联系起来，然后我必须将这些数字与这些类别联系起来......我认为制作字典会很有用来自 LIWC 的两个列表。这是我到目前为止所得到的：

with open('LIWC_words.txt', 'rU') as document:
    answer = {}
    for line in document:
        line = line.split()
        if not line:  #empty line
            continue
        answer[line[0]] = line[1:]

with open ('LIWC_categories.txt','rU') as document1:
    categoriesLIWC = {}
    for line in document1:
        line = line.strip()
        if not line:
            continue
        key, value = line.split(':')
        if key.isdigit():
            categoriesLIWC[int(key)] = value
        else:
            categoriesLIWC[key] = value

所以我现在有两本字典......但现在我被困住了。有谁知道我接下来应该做什么？（我使用 python 2.6.5，因为我主要使用 NLTK）

score 0 · Accepted Answer

我不确定您要创建的最终格式到底是什么。例如，您可以制作一个字典，其中包含其中包含dict['pronoun']的所有行。document'01'

#for example from this format
dic = {'word1': [1,2,3], 'word2':[3,2]}
ref = {1: 'pronoun', 2: 'I' , 3: 'you'}

out = {}

for word in dic:
  for entry in dic[word]:
    if entry in out:
      out[entry].append(word)
    else:
      out[entry] = []
      out[entry].append(word)

print out
>>>{1: ['word1'], 2: ['word1', 'word2'], 3: ['word1', 'word2']}

或者，您可以将中的数字替换为中document的条目document1。

#for example from this format
dic = {'word1': [1,2,3], 'word2':[3,2]}
ref = {1: 'pronoun', 2: 'I' , 3: 'you'}

for word in dic:
  for indx in range(len(dic[word])): 
    dic[word][indx] = ref[dic[word][indx]]

print dic
>>>{'word1': ['pronoun', 'I', 'you'], 'word2': ['you', 'I']}

否则你有没有想过一个数据库？

score 0 · Accepted Answer

这是将数据转换为该格式的一种方法。

dic = {}
ref = {}
tempdic = open('dic.txt','r').read().split('\n')
tempref = open('ref.txt','r').read().split('\n')

for line in tempdic:
  if line:
    line = line.split()
    dic[line[0]] = line[1:]
for line in tempref:
  if line:
    line = line.split(':')
    ref[line[0]] = line[1]
#dic = {'word1':[1,2,3], word2:[2,3]...}
#ref = {1:'ref1',2:'ref2',...}
for word in dic:
  for indx in range(len(dic[word])):#for each number after word
    dic[word][indx] = ref[dic[word][indx]]

假设我们从{'apple':[1,2,3]}. dic['apple'][0]会解决1，右边会是ref[1]哪个可以'pronoun'。这将使我们{'apple' : ['pronoun', 2, 3]剩下的数字在下一次迭代中被替换。

python - 使用python将两个单词和类别列表与自己的语料库链接起来

2 回答 2

Related

Reference