我正在尝试使用 AWS 上提供的百万歌曲数据集来查找曲目响度与其流行度之间的相关性。我按照基本教程(http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/)获取每个曲目的数据,并建立我的项目使用 MRJob 和 Python。现在我迷失了如何在使用映射器和减速器时找到轨道之间的相关性。到目前为止,这是我的代码:
from mrjob.job import MRJob
import track
YIELD_ALL = True
class MRDensity(MRJob):
def mapper(self, _, line):
t = track.load_track(line)
if t:
if t['tempo'] > 0:
loudness = t['loudness']
#print loudness
hotness = t['song_hotttnesss']
xy = loudness * hotness
x2 = loudness * loudness
y2 = hotness * hotness
counter = counter + 1
yield (counter, (loudness, hotness, xy,x2,y2))
def reducer(self, key, val):
sumx2 = 0
sumy2 = 0
sumxy = 0
sumh = 0
suml = 0
for l, h, xy, x2, y2 in val:
suml = suml + l
sumh += h
sumxy += xy
sumx2 += x2
sumy2 += y2
yield key, suml
if __name__ == '__main__':
MRDensity.run()
这段代码并没有真正起作用,因为它产生了这个:
1 -10.142
1 -10.212
1 -11.137
1 -11.197
1 -13.496
1 -15.568
1 -15.607
1 -17.302
1 -22.262
1 -3.383
1 -3.809
1 -5.816
1 -5.902
1 -6.671
1 -7.24
1 -7.591
1 -8.729
1 -9.689
1 -9.738
1 -9.863
我需要帮助编写其余代码来计算MSD 数据集的loudness
和变量之间的相关性。hotness
谢谢!