1

I am starting development with R and I am still having "beginner problems" with the language. I would like to do the following:

  1. I have a matrix (data frame:=user) with ~900 columns, each of them is the name of a band (Nirvana, Green Day, Daft-Punk, etc.).
  2. In each row I have an user and the user's music taste (Nirvana = 10, Green Day=5, Daft Punkt=0)
  3. I would like to query another dataframe(:=artists - with the artist's music tags) and substitute the name of the bands by its Genre-Tag (Nirvana --> Rock, Green Day --> Rock, Daft-Punk --> Techno). There are ~120 Tags for music taste (120 < 900)
  4. And finally, I would like to "aggregate" the values over all columns to avoid duplicated columns. In the example from (3) - with the aggregation function "SUM" - the row would have only 2 entries and not 3: (Rock = 15, Techno=0)

Any clues on how to do that with R? Thanks in advance for any help!

Data:

user: pastebin.com/4gVe004T

artists: pastebin.com/dm7weLMG

4

1 回答 1

2

我有一个包含约 900 列的矩阵(数据框:=用户),每列都是乐队的名称(Nirvana、Green Day、Daft-Punk 等)。
在每一行中,我都有一个用户和用户的音乐品味(Nirvana = 10,Green Day=5,Daft Punkt=0)

这就是所谓的“宽”格式。对于大多数任务来说,最好将其重塑为窄格式,即具有两列的单个 data.frame,一列标识用户,另一列标识波段。有几个工具可以做到这一点,这里有几个关于 SO 的问题。特别寻找标签。

还有一个名为的包reshape可以在这里提供帮助。我所说的过程被称为“融化”数据。

我想查询另一个数据框(:=艺术家 - 带有艺术家的音乐标签)并用其流派标签替换乐队的名称(Nirvana --> Rock, Green Day --> Rock, Daft-Punk -->技术)。有约 120 个音乐品味标签 (120 < 900)

您可以使用merge波段名称作为合并键来组合多个数据框。这就是为什么您希望波段名称是值而不是列名的原因。

最后,我想“聚合”所有列的值以避免重复的列。在 (3) 的示例中 - 使用聚合函数“SUM” - 该行将只有 2 个条目而不是 3 个:(Rock = 15, Techno=0)

当您使用reshape将数据“转换”回宽格式时,您可以提供一个聚合函数,用于组合值。你可以使用sum它。

于 2013-07-02T11:22:41.110 回答