我想像这样在 python 中创建二维数组:
n1 n2 n3 n4 n5
w1 1 4 0 1 10
w2 3 0 7 0 3
w3 0 12 9 5 4
w4 9 0 0 9 7
其中 w1 w2... 是不同的词,而 n1 n2 n3 是不同的博客。
我怎样才能做到这一点?
我想像这样在 python 中创建二维数组:
n1 n2 n3 n4 n5
w1 1 4 0 1 10
w2 3 0 7 0 3
w3 0 12 9 5 4
w4 9 0 0 9 7
其中 w1 w2... 是不同的词,而 n1 n2 n3 是不同的博客。
我怎样才能做到这一点?
假设每个博客中的文本都以字符串形式提供,并且您在 中提供了此类字符串的列表blogs
,这就是您创建矩阵的方式。
import re
# Sample input for the following code.
blogs = ["This is a blog.","This is another blog.","Cats? Cats are awesome."]
# This is a list that will contain dictionaries counting the wordcounts for each blog
wordcount = []
# This is a list of all unique words in all blogs.
wordlist = []
# Consider each blog sequentially
for blog in blogs:
# Remove all the non-alphanumeric, non-whitespace characters,
# and then split the string at all whitespace after converting to lowercase.
# eg: "That's not mine." -> "Thats not mine" -> ["thats","not","mine"]
words = re.sub("\s+"," ",re.sub("[^\w\s]","",blog)).lower().split(" ")
# Add a new dictionary to the list. As it is at the end,
# it can be referred to by wordcount[-1]
wordcount.append({})
# Consider each word in the list generated above.
for word in words:
# If that word has been encountered before, increment the count
if word in wordcount[-1]: wordcount[-1][word]+=1
# Else, create a new entry in the dictionary
else: wordcount[-1][word]=1
# If it is not already in the list of unique words, add it.
if word not in wordlist: wordlist.append(word)
# We now have wordlist, which has a unique list of all words in all blogs.
# and wordcount, which contains len(blogs) dictionaries, containing word counts.
# Matrix is the table that you need of wordcounts. The number of rows will be
# equal to the number of unique words, and the number of columns = no. of blogs.
matrix = []
# Consider each word in the unique list of words (corresponding to each row)
for word in wordlist:
# Add as many columns as there are blogs, all initialized to zero.
matrix.append([0]*len(wordcount))
# Consider each blog one by one
for i in range(len(wordcount)):
# Check if the currently selected word appears in that blog
if word in wordcount[i]:
# If yes, increment the counter for that blog/column
matrix[-1][i]+=wordcount[i][word]
# For printing matrix, first generate the column headings
temp = "\t"
for i in range(len(blogs)):
temp+="Blog "+str(i+1)+"\t"
print temp
# Then generate each row, with the word at the starting, and tabs between numbers.
for i in range(len(matrix)):
temp = wordlist[i]+"\t"
for j in matrix[i]: temp += str(j)+"\t"
print temp
现在,将包含该单词在 blog 中出现matrix[i][j]
的次数。wordlist[i]
blogs[j]
如果列表或字典中的元组不起作用,请考虑使用pandas:
from pandas import *
In [554]: print DataFrame({'n1':[1,3,0,9], 'n2':[4,0,12,0], 'n3':[0,7,9,0], 'n4':[1,0,5,9], 'n5':[10,3,4,7]},index=['w1','w2','w3','w4'])
n1 n2 n3 n4 n5
w1 1 4 0 1 10
w2 3 0 7 0 3
w3 0 12 9 5 4
w4 9 0 0 9 7
我根本不会创建任何列表,也不会创建二维数组,而是创建一个由 x 和 y 标头作为键的字典作为元组。如:
data["w1", "n1"] = 1
这可以被认为是一种“稀疏矩阵”表示。根据您要对数据执行的操作,您可能还需要一个 dict 的 dict,其中外部 dict 的键是 xheader 或 yheader,而内部 dict 的键是相反的。
假设元组作为键表示,将您的数据表作为输入:
text = """\
n1 n2 n3 n4 n5
w1 1 4 0 1 10
w2 3 0 7 0 3
w3 0 12 9 5 4
w4 9 0 0 9 7
"""
data = {}
lines = text.splitlines()
xheaders = lines.pop(0).split()
for line in lines:
if not line.strip():
continue
elems = line.split()
yheader = elems[0]
for (xheader, datum) in zip(xheaders, elems[1:]):
data[xheader, yheader] = int(datum)
print data
print sorted(data.items())
打印产生:
{('n3', 'w4'): 0, ('n4', 'w2'): 0, ('n2', 'w2'): 0, ('n1', 'w4'): 9, ('n3', 'w3'): 9, ('n2', 'w3'): 12, ('n3', 'w2'): 7, ('n2', 'w4'): 0, ('n5', 'w3'): 4, ('n2', 'w1'): 4, ('n4', 'w1'): 1, ('n5', 'w2'): 3, ('n5', 'w1'): 10, ('n4', 'w3'): 5, ('n4', 'w4'): 9, ('n1', 'w3'): 0, ('n1', 'w2'): 3, ('n5', 'w4'): 7, ('n1', 'w1'): 1, ('n3', 'w1'): 0}
[(('n1', 'w1'), 1), (('n1', 'w2'), 3), (('n1', 'w3'), 0), (('n1', 'w4'), 9), (('n2', 'w1'), 4), (('n2', 'w2'), 0), (('n2', 'w3'), 12), (('n2', 'w4'), 0), (('n3', 'w1'), 0), (('n3', 'w2'), 7), (('n3', 'w3'), 9), (('n3', 'w4'), 0), (('n4', 'w1'), 1), (('n4', 'w2'), 0), (('n4', 'w3'), 5), (('n4', 'w4'), 9), (('n5', 'w1'), 10), (('n5', 'w2'), 3), (('n5', 'w3'), 4), (('n5', 'w4'), 7)]
一种方法是使用numpy:
>>> from numpy import array
>>> array( [ (1,4,0,1,10), (3,0,7,0,3), (0,12,9,5,4), (9,0,0,9,7) ] )
array([[ 1, 4, 0, 1, 10],
[ 3, 0, 7, 0, 3],
[ 0, 12, 9, 5, 4],
[ 9, 0, 0, 9, 7]])
如果你只是想要没有任何解析的二维数组,你可以这样写:
a = [
[1, 4, 0, 1, 10],
[3, 0, 7, 0, 3],
[0, 12, 9, 5, 4],
[9, 0, 0, 9, 7]
]