3

好的,我正在尝试传输值列表以及有关该值列表的信息。我正在尝试在处理数据时做到这一点。让我告诉你发生了什么:

worddictlist2 = []
for innertweet in namelist:
        worddictlist = []
        for tweet in innertweet[0]:
                worddict = {word: tweet.count(word) for word in wordlist}
                worddictlist.append(worddict)
                worddictlist2.append(worddictlist)

namelist 是一个包含以下信息的变量:

[[['blah blah blah string blah blah blah blah blah blah', 'another string, blah blah blah, string string', 'string string string'], category], ['string string another string, blah', 'more words, more words, etc', 'yet again, here we go'], category2]

我正在计算每个短语中特定单词出现的次数。但是我仍然想以某种方式保留类别分配。

我一直在尝试在各个循环中附加不同的列表,我尝试了不同的列表推导,但我只是没有看到我想要的结果,如下所示:

[[{word1: 0, word2: 7, word3: 12, word4: 6}, category], {word1: 3, word2: 9, word3: 1, word4: 2}, category2]]

我怎样才能得到这个输出?我这样做效率低吗?我折磨这些数据的方式让我觉得我做这个过程效率低下。

4

3 回答 3

1

首先,在当前代码中worddict,每条推文都会重新创建,这可能不是您想要的。此外,使用该方法str.count()您冒着将推文中出现的单词计数为另一个单词的一部分的风险,例如,'as is the case'.count('as')将是 2,而不是 1,因为作为子字符串as出现在单词case中。我建议用空格分割推文,而不是迭代该分割中的唯一单词,比如words = tweet.split(){word: words.count(word) for word in list(set(words))或简单地迭代单词并为每次出现的单词增加字典中的计数,我不确定哪个更多高效的。

所以,我的建议是

worddictlist2 = []
for innertweet in namelist:
    worddict = {}
    for tweet in innertweet[0]:
        words = tweet.split()
        for word in words:
            if not worddict.has_key(word):
                worddict[word] = 1
            else:
                worddict[word] += 1
    worddictlist2.append([worddict, innertweet[1]])

给定输入

namelist = [[['blah blah blah string blah blah blah blah blah blah', 'another string, blah blah blah, string string', 'string string string'], 'category'], [['string string another string, blah', 'more words, more words, etc', 'yet again, here we go'], 'category2']]

此代码生成

[[{'blah,': 1, 'blah': 11, 'string,': 1, 'string': 6, 'another': 1}, 'category'], [{'string,': 1, 'string': 2, 'again,': 1, 'etc': 1, 'we': 1, 'here': 1, 'blah': 1, 'words,': 2, 'another': 1, 'go': 1, 'yet': 1, 'more': 2}, 'category2']]

为了摆脱带有逗号的单词,您可能希望在计算单词之前消除标点符号,例如通过添加tweet = re.sub(r'[^a-zA-Z0-9]', ' ', tweet)到上面的代码:

import re

worddictlist2 = []
for innertweet in namelist:
    worddict = {}
    for tweet in innertweet[0]:
        tweet = re.sub(r'[^a-zA-Z0-9]', ' ', tweet)
        words = tweet.split()
        for word in words:
            if not worddict.has_key(word):
                worddict[word] = 1
            else:
                worddict[word] += 1
    worddictlist2.append([worddict, innertweet[1]])

print worddictlist2

产生

[[{'blah': 12, 'string': 7, 'another': 1}, 'category'], [{'again': 1, 'we': 1, 'string': 3, 'etc': 1, 'here': 1, 'blah': 1, 'another': 1, 'words': 2, 'go': 1, 'yet': 1, 'more': 2}, 'category2']]
于 2012-07-31T10:49:08.867 回答
1

给定数据:

category = "C"
category2 = "C2"

namelist = [
  [['blah blah blah string blah blah blah blah blah blah', 'another string, blah blah blah, string string', 'string string string'],
   category
  ],
  [['string string another string, blah', 'more words, more words, etc', 'yet again, here we go'],
   category2
  ]
]

wordlist = "blah string words".split()

那么这应该像描述的那样工作:

from collections import defaultdict

worddictlist2 = []
for innertweet in namelist:
    worddict = defaultdict(lambda: 0)
    category = innertweet[1]
    for tweet in innertweet[0]:
        for word in wordlist:
            worddict[word] += tweet.count(word)

    # optional - transform defaultdict into standard dict to make it printable
    worddictClean = {}
    worddictClean.update(worddict)

    worddictlist2.append([worddictClean, category])

print worddictlist2

它输出:

[[{'blah': 12, 'string': 7, 'words': 0}, 'C'], [{'blah': 1, 'string': 3, 'words': 2}, 'C2']]
于 2012-07-31T10:58:49.830 回答
0

或许是这样的:

worddictlist2 = []
wdlist = {}
for innertweet,cat in namelist:
   for i in innertweet:
      for j in i.split():
         j = j.strip(',') # strip comma
         wdlist.setdefault(j,0) # if 'j' unknown key
         wdlist[j] += 1
   worddictlist2.append(wdlist, cat)
   wdlist = {}


print(worddictlist2)

给出:

[
 [{'another': 1, 'blah': 12, 'string': 7}, 'category'],
 [{'again': 1, 'another': 1, 'blah': 1, 'etc': 1, 'go': 1, 'here': 1, 'more': 2, 'string': 3, 'we': 1, 'words': 2, 'yet': 1}, 'category2']
]
于 2012-07-31T11:22:57.077 回答