0

我对 R 很陌生,使用 geom_raster() 函数创建热图时遇到了一些麻烦。因此,我本周正在处理 tidytuesday 挑战,我想创建一张热图,以显示举办比赛是否对主办团队有利。我查看指标:分别代表xy值的team_namepole 。然后我用主变量填充图表,看看每支球队是否有任何趋势,它的完赛位置,以及他们是否是比赛的东道主。

下面是我用来创建热图和热图本身的代码片段。到此为止,我整理了数据,这就是时髦数据名称的原因。

pole_position <- c("P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8", "P9", "P10", "P11", "P12", "P13", "P14", "P15", "P16")

ggplot(data = clean_marbles_2, mapping = aes(x = team_name, y = pole, fill = host)) +
  geom_raster() +
  scale_y_discrete(limits = pole_position) +
  coord_flip() +
  labs(x = "Team name", y = "Finish placement", title = "Does hosting the race affect finish placement?")

上面的代码提供了这个图形。

起初我认为这是一个非常酷的图形,但我很快意识到它缺少一些“是”的主机。此图中应该有 16 个不同的蓝绿色盒子,但只有 11 个。

然后,我对图表进行了分面,以确定它是否能够识别输入的数据。下面是生成图形的代码和照片。pole_position 的值在两个图表之间没有变化。

ggplot(data = clean_marbles_2, mapping = aes(x = team_name, y = pole, fill = host)) +
  geom_raster() +
  scale_y_discrete(limits = pole_position) +
  coord_flip() +
  labs(x = "Team name", y = "Finish placement", title = "Does hosting the race affect finish placement?") +
  facet_wrap(~host)

刻面图形

如您所见,所有十六个蓝色瓷砖都出现在“是”区域中。我完全不明白为什么之前的图形只记录了 16 个蓝色瓷砖中的 11 个。

我的问题是:为什么不是所有的蓝色瓷砖都出现在第一个图形中?

感谢任何帮助和/或建设性的批评。谢谢!

这是 tidytuesday Github 存储库的链接:这里。

编辑:

这是我对整理数据所做的事情,请不要因为我做错了什么而责备我,我很想学习任何提高编码效率的方法。

# Read in the data from the github repo

marbles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-02/marbles.csv')

# Set the correct point & pole values

marbles$points[marbles$pole == 'P1'] = 25

marbles$pole[marbles$points == 25] = 'P1'

marbles$points[marbles$pole == 'P2'] = 18

marbles$pole[marbles$points == 18] = 'P2'

marbles$points[marbles$pole == 'P3'] = 15

marbles$pole[marbles$points == 15] = 'P3'

marbles$points[marbles$pole == 'P4'] = 12

marbles$pole[marbles$points == 12] = 'P4'

marbles$points[marbles$pole == 'P5'] = 10

marbles$pole[marbles$points == 10] = 'P5'

marbles$points[marbles$pole == 'P6'] = 8

marbles$pole[marbles$points == 8] = 'P6'

marbles$points[marbles$pole == 'P7'] = 6

marbles$pole[marbles$points == 6] = 'P7'

marbles$points[marbles$pole == 'P8'] = 4

marbles$pole[marbles$points == 4] = 'P8'

marbles$points[marbles$pole == 'P9'] = 2

marbles$pole[marbles$points == 2] = 'P9'

marbles$points[marbles$pole == 'P10'] = 1

marbles$pole[marbles$points == 1] = 'P10'

marbles$points[marbles$pole == 'P11'] = 0

marbles$pole[marbles$points == 0] = 'P11'

# replace any excess and incorrect pole/point values to align with my scale.

marbles[186, 8] = 'P10'

marbles[186, 9] = 1

# Replace the pole values for the 0 point scores
# This was done for many more values than what is seen here.

marbles[252,8] = 'P12'

marbles[253,8] = 'P13'

marbles[254,8] = 'P14'

marbles[255,8] = 'P15'

marbles[256,8] = 'P16'

# Remove the notes and source sections of the tidy data

clean_marbles = subset(marbles, select = -c(notes, source))

# Create a clean subset without any NA values

clean_marbles_2 = na.omit(clean_marbles)

我知道这是非常乏味的。您可以在我上面包含的代码中看到点和极点的对应值。我试图使数据更加统一,认为之后更容易可视化,但我想不是。

4

2 回答 2

1

这是一种方法,geom_tile而不是geom_raster使用过滤器和两次调用geom_tile

ggplot(data = clean_marbles_2 %>% filter(host == "No"), mapping = aes(x = team_name, y = pole)) +
  geom_tile(fill = "#F8766D") +
  geom_tile(data = clean_marbles_2 %>% filter(host == "Yes"), fill = "#00BFC4") +
  scale_y_discrete(limits = pole_position) +
  coord_flip() +
  labs(x = "Team name", y = "Finish placement", title = "Does hosting the race affect finish placement?")

在此处输入图像描述

我们需要使用geom_tile,因为geom_raster会在行周围移动。

这是一种使用dplyr::recode. 运算符将!!!列表扩展为要传递给函数的参数。这是必需的,因为recode需要单独的参数。

我们可以ifelse用来只替换 NA 中的 NA pole。因为我们没有使用分数,所以我没有费心重新编码那个,但你可以很容易地反过来。

clean_marbles_2 <- marbles %>% 
  mutate(pole = 
           ifelse(is.na(pole),
                  recode(marbles2$points,
                         !!!c(`26` = "P1", `25` = "P1", `19` = "P2",
                              `18` = "P2", `16` = "P3", `15` = "P3",
                              `13` = "P4", `12` = "P4", `11` = "P5",
                              `10` = "P5", `8` = "P6", `6` = "P7",
                              `4` = "P8", `2` = "P9", `1` = "P10",
                              `0` = "P11")),
                        pole)) %>%
  dplyr::select(-notes, -source)
于 2020-06-04T23:14:28.860 回答
1

您整理数据的方式似乎存在问题。如果我们在此表示中使用原始数据进行绘图,则不会出现您的错误:

library(ggplot2)

url <- paste0("https://raw.githubusercontent.com/rfordatascience/",
              "tidytuesday/master/data/2020/2020-06-02/marbles.csv")

raw_marbles  <- read.csv(url)
pole_position <- paste0("P", 1:16)

p <- ggplot(raw_marbles, aes(x = team_name, y = pole, fill = host)) +
  geom_raster() +
  scale_y_discrete(limits = pole_position) +
  coord_flip() +
  labs(x = "Team name", y = "Finish placement", 
       title = "Does hosting the race affect finish placement?")

p

在此处输入图像描述

似乎有些图块“丢失”了,但那是因为它们在原始数据中没有分配位置。我们还可以确认此处显示了正确数量的蓝色方块:

p + facet_wrap(.~host)

在此处输入图像描述 所以我想问题是“你对原始数据做了什么?”。展示clean_marbles_2您的问题可能会让我们解决这个问题。

顺便说一句,托管与非托管似乎确实存在影响。你可以做一个 Wilcox 测试来显示它:

NoYes <- lapply(split(raw_marbles$pole, raw_marbles$host), 
                function(x) na.omit(as.numeric(substr(x, 2, 3))))

wilcox.test(NoYes[[1]], NoYes[[2]])

#>  Wilcoxon rank sum test with continuity correction
#> 
#> data:  NoYes[[1]] and NoYes[[2]]
#> W = 280, p-value = 0.04911
#> alternative hypothesis: true location shift is not equal to 0

因此,东道主的极数似乎要高得多(即更接近 P16)。

于 2020-06-04T22:04:10.800 回答