Load the data

The excel file “olympics.xlsx” has 4 sheets that I have to load separately. The following code load the “athletes” sheet.

athletes <- read_excel("olympics.xlsx", sheet = "athletes") %>%
  clean_names() %>% 
  glimpse()
## Rows: 135,571
## Columns: 5
## $ id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
## $ sex    <chr> "M", "M", "M", "M", "F", "M", "M", "F", "M", "M", "M", "M", "F"…
## $ height <dbl> 180, 170, NA, NA, 185, 188, 183, 168, 186, NA, 182, 172, 159, 1…
## $ weight <dbl> 80.0, 60.0, NA, NA, 82.0, 75.0, 72.0, NA, 96.0, NA, 76.5, 70.0,…

The following code load the “country” sheet.

country <- read_excel("olympics.xlsx", sheet = "country") %>% 
  clean_names() %>% 
  glimpse()
## Rows: 187,452
## Columns: 4
## $ athlete_id <dbl> 1, 2, 3, 4, 5, 5, 5, 6, 6, 7, 7, 8, 9, 10, 11, 12, 13, 13, …
## $ age        <dbl> 24, 23, 24, 34, 21, 25, 27, 31, 33, 31, 33, 18, 26, 26, 22,…
## $ noc        <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "USA", "US…
## $ games      <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer",…

The following code load the “games” sheet.

games <- read_excel("olympics.xlsx", sheet = "games") %>% 
  clean_names() %>% 
  glimpse()
## Rows: 52
## Columns: 4
## $ games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
## $ year   <dbl> 1992, 2012, 1920, 1900, 1988, 1992, 1994, 1932, 2002, 1952, 198…
## $ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
## $ city   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Albert…

I am asked to check the year 1956.

games %>% 
  filter(year == 1956)
Comment

Each year usually has 2 games: 1 for summer and 1 for winter. In this case according to the data, there was 3 games, 2 summer games took place in different cities: Melbourne and Stockholm. According to the IOC website (https://www.olympics.com/en/olympic-games/melbourne-1956) the 1956 summer games were held in Melbourne. Even though one equestrian competition was held in Stockholm which can be the cause of this anomaly. In the future analysis I will consider that the 1956 summer game was held in Melbourne.

The following code load the “medals” sheet.

medals <- read_excel("olympics.xlsx", sheet = "medals") %>% 
  clean_names() %>% 
  glimpse()
## Rows: 271,116
## Columns: 6
## $ athlete_id <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7,…
## $ games      <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer",…
## $ team       <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands…
## $ sport      <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skat…
## $ event      <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweigh…
## $ medal      <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Data exploration

I will use vis_miss() for as many tibble as possible to check the missing values. And then check for duplicates for the “id” columns especially.

vis_miss(athletes)

Comment

The above vizualization shows that there are 25% and 26% of missing values in the “height” and “weight” column in the “athlete” tibble. The rest of the columns don’t have any missing values.

athletes %>% 
  count(id) %>% 
  filter(n>1)
Comment

There is no duplicates in the “id” column of the “athletes” tibble.

athletes %>%
  drop_na() %>% 
  summarise(mean_weight = mean(weight), 
            min_weight = min(weight), 
            max_weight = max(weight),
            mean_height = mean(height),
            min_height = min(height),
            max_height = max(height))
Comment

The values from the weight and height seem realistic except for the maximum height of 226 cm and the minimum weight of 25 kg.

vis_miss(country)

Comment

The above vizualization shows that there are 4% of missing values in the “age” column in the “country” tibble. The rest of the columns don’t have any missing values.

country %>% 
  count(athlete_id) %>% 
  filter(n>1) %>% 
  arrange(desc(n)) %>% 
  paged_table()
Comment

There are several “athlete_id” duplicates in the “country” column. The reason is probably because the same athletes have participated in different games and maybe even for different countries.

vis_miss(games)

Comment

The above vizualization shows that there no missing values in the “games” tibble.

games %>% 
  count(year) %>% 
  filter(n>2)
Comment

1956 is the only year which is repeated three times.

games %>% 
  count(year) %>% 
  filter(n<2) %>% 
  paged_table()
Comment

On the other hand, several years are only mentioned once, as I assume there was only a game per year.

medals %>% 
  count(athlete_id) %>% 
  filter(n>1) %>% 
  arrange(desc(n)) %>% 
  paged_table()
Comment

There are over 57000 duplicates for the “athlete_id” in the “medal” tibble, this is because over the years many athletes have won several medals. I note that the number of occurrences of the id doesn’t mean the equivalent number of medals. Since there are some NAs in the “medal” column, as I will observe later.

medals %>% 
  filter(is.na(athlete_id))

Comment

Since the “medal” tibble is too heavy I cannot use vis_miss(). There fore I filtered by NA. I notice that there is no missing value in the “athlete_id” column.

medals %>%
  tabyl(medal) %>% 
  kable()
medal n percent valid_percent
Bronze 13295 0.0490381 0.3341880
Gold 13372 0.0493221 0.3361235
Silver 13116 0.0483778 0.3296886
NA 231333 0.8532621 NA
Comment

Still regarding the “medals” tibble, I can see that 85% of the values from the “medal” column are missing, 4.8% are silver medals, 4.9% are gold medals, 4.9% are bronze medals.

Data analysis

Athletes info

Part 1

Instructions

Have some athletes competed for different countries over time?

Use inline code to print the result and comment your findings. How would you explain these results?

Explanation

To answer to this question I mostly need the “country” tibble. I can first look for the unique combinations for id and country (the “noc” column) to identify the athletes who competed for different countries by using distinct(). Now each row is an athlete competing for distinct countries. Next, I can group by the id, and count the number of rows per athlete id, which will indicate the number of country changes. To only keep the athletes who changed more than once, I use the filter function. The result is tibble “diff_country”.

Since there are more than 10 athletes who changed countries, I want to extract the athletes who changed the most times with the “top10_diff_country” thanks to slice_max(). I also have to use distinct() because there are several rows with the same id. At the end I create a new tibble called “names” with the athletes’ names and id in order to merge it with the last tibble I created.

diff_country <- country %>% 
  distinct(athlete_id, noc) %>%
  group_by(athlete_id) %>% 
  mutate(n_country_changes = n()) %>% 
  ungroup() %>% 
  filter(n_country_changes >1)

top10_diff_country <- diff_country %>% 
  distinct(athlete_id, n_country_changes) %>% 
  slice_max(order_by = n_country_changes, n= 10, with_ties = FALSE)

names <- athletes %>% 
  select(id, name)

names_top10_diff_country <- top10_diff_country %>% 
  left_join(names, by=c("athlete_id" = "id"))

kable(names_top10_diff_country, caption="Athletes who competed for different countries over time")
Athletes who competed for different countries over time
athlete_id n_country_changes name
59021 4 Makharbek Khazbiyevich Khadartsev
66953 4 Irina Lashko (-Furler-)
72433 4 Ilija Lupulesku
108383 4 Jasna ekari (Brajkovi-)
111969 4 Micha liwiski
2425 3 Yamil Aldama Pozo (-Dodds)
4677 3 Dmitry Borisovich Apanasenko
5447 3 Loedmila Arzjannikova
5839 3 Anzhelika Petrovna “Anzhela” Atroshchenko (-Kinet)
7054 3 Andrey Pavlovich Bakhvalov
Comment

The number of athletes who competed for different countries is 3242. In the top 10, I see that Makharbek Khazbiyevich Khadartsev changed countries 4 times.

Part 2

Instructions

Who are the ten athletes that took part in most games?

Explanation

For this exercise I can mostly use the “medals” tibble. This time, I’m interested in the number of games the athletes participated in. This is the unique combination I have to look for, again I can use distinct() for the id and the games. Then, group by the id, and create a column that counts the number of rows per athlete which equals to the number of games they participated in.

To only get the top 10 I create another tibble “games_athlete10” where I use distinct() on the id and number of games columns to have one row per athlete and use slice_max().

As a final step, I have to merge this tibble with the “names” that contains the athletes’ names and matching ids, the result is the tibble “games_athlete10_name”,

games_athlete <- medals %>% 
  distinct(athlete_id, games) %>% 
  group_by(athlete_id) %>% 
  mutate(n_games_athlete = n()) %>% 
  ungroup()

games_athlete10 <- games_athlete %>% 
  distinct(athlete_id, n_games_athlete) %>% 
  slice_max(order_by = n_games_athlete, n= 10, with_ties = FALSE)

games_athlete10_name <- games_athlete10 %>% 
  left_join(names, by=c("athlete_id" = "id")) 

kable(games_athlete10_name, caption="Top ten athletes who took part in most games")
Top ten athletes who took part in most games
athlete_id n_games_athlete name
79855 10 Ian Millar
65378 9 Afanasijs Kuzmins
99155 9 Hubert Raudaschl
14388 8 Francisco Boza Dibos
26880 8 Rajmond Debevec
28051 8 Piero D’Inzeo
28052 8 Raimondo D’Inzeo
32458 8 Paul Bert Elvstrm
51618 8 Josefa Idem-Guerrini
61572 8 Durward Randolph Knowles
Comment

The athlete Ian Millar is the athlete who participated in most games, with a score of 10 games. With 9 games, Afanasijs Kuzmins and Hubert Raudaschl are the two athletes who come in second position. The rest of the top 10 participated in 8 games.

Part 3

Instructions

Who are the ten athletes that competed in the most events? Did some athletes take part in more than one event during a game?

Explanation

I’m interested to know the number of events the athletes participated overall. For this exercise I want to count the unique combinations of athletes id and events. The same event only counts once. To isolate the unique combinations of athlete id-event, I use distinct() and then use group__by() the id to get the number of rows per id. This number is the number of events per athlete and is computed in the “n_events_athletes” column. The result of this operation is tibble “athletes_events”.

Next, I want to get one row per athlete and extract the top ten athlete that competed in most events. I use the distinct() function and then slice_max().

Finally I merge it with the “names” tibble.

athletes_events <- medals %>% 
  distinct(athlete_id, event) %>% 
  group_by(athlete_id) %>% 
  mutate(n_events_athletes = n()) %>% 
  ungroup()

athletes_events_10 <- athletes_events %>% 
  distinct(athlete_id, n_events_athletes) %>% 
  slice_max(order_by = n_events_athletes, n=10, with_ties = FALSE)

athletes_events_10_names <- athletes_events_10 %>% 
  left_join(names, by= c("athlete_id" = "id"))

kable(athletes_events_10_names, caption="Athletes who competed in most events")
Athletes who competed in most events
athlete_id n_events_athletes name
119591 33 Ioannis Theofilakis
119590 28 Alexandros Theofilakis
18255 24 Gustaf Eric Carlberg
18257 22 Gustaf Vilhelm Carlberg
76886 22 Frangiskos D. Mavrommatis
124156 20 Paul Van Asbroeck
73260 19 Lars Jrgen Madsen
81899 19 Lon Ernest Moreaux
25946 18 Marie Joseph “Raoul” le Borgne de Boigne
55421 18 Lon douard Johnson
Comment

Ioannis Theofilakis is the athlete who competed in the most events (33), after him comes Alexandros Theofilakis who participated in 28 events. These events are counted regardless of the game, so the second question remains to be answered.

Explanation

For this part, I’m interested in the number of events the athletes participated in during each game. This time, I have to first group by the athletes’ id and the games. Within these groups, I now want to isolate the unique combinations of athletes and events with distinct(). Then, create a column “n_event_per_game” that counts the number of rows for these unique combinations within the groups. The result is the tibble “athlete_event_game”.

Next, I want each row to be a distinct athlete, game and number of event per game and filter out the athletes who did one event per game. The result is tibble “athlete_event_game_diff”.

Finally, I merge it with the “names” tibble.

athlete_event_game <- medals %>% 
  group_by(athlete_id, games) %>% 
  distinct(athlete_id, event) %>% 
  mutate(n_event_per_game = n()) %>% 
  ungroup()
  
athlete_event_game_diff <- athlete_event_game %>% 
  distinct(athlete_id, games, n_event_per_game) %>% 
  filter(n_event_per_game > 1) %>% 
  arrange(desc(n_event_per_game))

athlete_event_game_diff_names <- athlete_event_game_diff %>% 
  left_join(names, by=c("athlete_id" = "id"))

athlete_event_game_diff_names %>% 
  head(10) %>% 
  kable()
athlete_id games n_event_per_game name
68189 1920 Summer 15 Willis Augustus Lee, Jr.
113935 1920 Summer 13 Lloyd Spencer Spooner
25946 1906 Summer 12 Marie Joseph “Raoul” le Borgne de Boigne
34282 1906 Summer 12 Maurice Faure
36489 1906 Summer 12 Jean Fouconnier
107613 1896 Summer 12 Carl Schuhmann
119591 1912 Summer 12 Ioannis Theofilakis
73260 1920 Summer 11 Lars Jrgen Madsen
76886 1912 Summer 11 Frangiskos D. Mavrommatis
78929 1906 Summer 11 Sidney Louis Walter Merlin
Comment

During the games of 1920 Summer, Willis Augustus Lee, Jr. participated in 15 events. In second position, comes Lloyd Spencer Spooner who participated in 13 events during the games of 1920 Summer. In third position, comes Marie Joseph “Raoul” le Borgne de Boigne who participated in 1906 Summer events during the games of 1906 Summer.

Countries info

Part 4

Instructions

Create a new table showing the number of medals per country (rows) and per year (column). Keep only the 15 countries with the most medals overall.

Explanations

I need to merge the “medals” tibble and the “country” tibble because the “team” column is not as precise as the “noc” column from the “country” tibble (I noticed that sometimes the team column has more than one country, or other values that are not countries).

After joining “country” and “medals”, I have to keep the 15 countries with the most medals overall, which means I can discard all rows with no medals, and then count the number of rows per country. The result it tibble “medals_count”.

Finally I use slice_max() to only obtain the first 15 countries with most medals overall. The result is tibble “medals_count_15”.

medals_count <- medals %>% 
  left_join(country) %>% 
  filter(!is.na(medal)) %>% 
  count(noc, sort = TRUE)

medals_count_15 <- medals_count %>% 
  slice_max(n, n=15) %>% 
  rename(country=noc)

  kable(medals_count_15, caption = "Top 15 countries with most medals overall")
Top 15 countries with most medals overall
country n
USA 5637
URS 2503
GER 2165
GBR 2068
FRA 1777
ITA 1637
SWE 1536
CAN 1352
AUS 1320
RUS 1165
HUN 1135
NED 1040
NOR 1033
GDR 1005
CHN 989
Comment

This table show the 15 countries with most medals overall. Yet, this table lost many data points that interest me, such as the years and number of medals per years. I’m also still missing the number of medals per country, per year.

I can compute the number of medals per country and year in a new column. To make sure I don’t build on previous errors, I decide to start again with the original tibble “medals” and join it with “country”. Next, I crate a “year” column by extracting every 4 digits from the column “games”. Finally, I can use group_by() on the country and year and create a new column “n_medals_country_year”.

medals_count_year <- medals %>% 
  left_join(country) %>% 
  filter(!is.na(medal)) %>% 
  mutate(year = str_extract(games, "\\b\\d{4}\\b")) %>% 
  group_by(year, noc) %>% 
  mutate(n_medals_country_year = n(), year, noc) %>% 
  ungroup()

Now, I can join the top 15 country selection with the most medals overall tibble: “medals_count_15” with “medals_count_year” and have country = noc as a key. I will then have to use the pivot function to have one year per column, and the number of medals per year, per country. In this question there is no distinction between a glod, silver or bronze medal, all types of medals are taken into account and summed.

top_15_countries_medals_year <- medals_count_15 %>% 
  left_join(medals_count_year, by= c("country" = "noc")) %>% 
  select(country, year, n_medals_country_year) %>% 
  distinct(country, year, n_medals_country_year) %>% 
  pivot_wider(names_from = "year", values_from = "n_medals_country_year", values_fill =0) %>%
  select(country, sort(names(.)[-1], decreasing = TRUE)) %>% 
  mutate(total = rowSums(across(-country)))

library(kableExtra)
top_15_countries_medals_year %>% 
  kable(format = "html", caption = "Number of medals per year for top 15 country with most medals overall") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width = "100%", height = "400px")
Number of medals per year for top 15 country with most medals overall
country 2016 2014 2012 2010 2008 2006 2004 2002 2000 1998 1996 1994 1992 1988 1984 1980 1976 1972 1968 1964 1960 1956 1952 1948 1936 1932 1928 1924 1920 1912 1908 1906 1904 1900 1896 total
USA 264 64 248 97 317 52 263 84 242 34 259 19 238 214 361 30 175 195 173 177 152 149 164 168 112 223 102 194 194 107 65 24 394 63 20 5637
URS 0 0 0 0 0 0 0 0 0 0 0 0 0 366 56 496 342 259 229 221 211 206 117 0 0 0 0 0 0 0 0 0 0 0 0 2503
GER 159 36 94 54 99 54 149 61 118 44 124 40 236 0 0 0 0 0 0 126 98 54 52 0 231 57 82 0 0 53 21 30 16 45 32 2165
GBR 145 10 126 1 81 1 57 6 54 4 26 3 50 54 73 48 33 29 19 30 28 46 32 63 53 35 59 95 111 170 368 39 2 108 9 2068
FRA 96 18 82 14 77 15 53 15 66 15 51 11 69 31 70 30 21 25 36 38 15 33 41 77 45 42 52 120 141 26 40 64 2 235 11 1777
ITA 72 14 68 5 42 25 104 21 65 17 71 32 66 37 65 40 35 34 41 61 89 55 56 68 76 77 72 51 84 25 8 56 0 5 0 1637
SWE 28 51 22 18 7 64 12 26 32 7 31 24 42 46 64 43 12 26 25 43 16 49 93 88 45 35 53 81 154 190 84 21 0 4 0 1536
CAN 69 86 55 90 35 69 17 74 31 49 50 37 80 29 91 2 26 12 30 12 30 39 21 22 35 59 45 29 16 8 52 2 48 2 0 1352
AUS 82 3 114 3 149 2 157 2 183 1 132 4 57 35 52 13 23 20 51 44 46 67 20 16 1 5 4 11 7 0 0 3 4 6 3 1320
RUS 115 68 140 25 142 41 189 38 187 52 115 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 3 0 0 0 0 1165
HUN 22 0 26 0 27 0 40 0 53 0 43 0 45 44 0 63 55 81 81 56 66 66 104 64 43 35 21 22 0 30 20 13 4 5 6 1135
NED 47 29 69 11 61 13 77 8 79 11 73 4 37 52 41 10 20 15 23 33 5 0 21 42 37 13 57 27 50 26 16 6 0 27 0 1040
NOR 19 36 17 39 22 23 7 41 43 40 23 30 47 33 19 15 16 25 28 15 11 7 32 31 46 10 24 39 137 80 42 25 2 9 0 1033
GDR 0 0 0 0 0 0 0 0 0 0 0 0 0 213 33 303 227 171 58 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1005
CHN 113 12 125 19 184 13 94 16 79 14 106 3 85 52 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 989

Part 5

Instructions

Is there a relationship between country and the probability of winning a medal?

We know that some countries win more medals than others. In this part we want to work out whether some countries win more medals simply because they have more athletes competing for more events, or whether their athletes are in fact performing better in general.

For this part, look only at the top 15 countries that you established in part 6.

  • Create a horizontal barchart listing for each of the fifteen countries the percentage of medals won out of all medals competed for by that country.

Explanation

For this exercise, I can’t ignore the NAs in the “medal” column, because to compute the percentage of medals won, I need the total medals competed for. The percentage of medal won per country = medals won * 100 /total medals competed for.

As a first step I count the rows per country in the “medals1” tibble, because it already has a year and country column. The rows per country are the number of athletes competing for each event, i.e. the number of medals the country is competing for.

medals_total <- medals %>%
  left_join(country) %>% 
  count(noc) %>% 
  rename(n_total = n)

As a second step, I create another tibble that only contains the medals won. I filter the NAs medals out and count again the rows per country. This tibble “medals_won” will only have 2 columns “noc” and “n”.

medals_won <- medals %>%
  left_join(country) %>% 
  filter(!is.na(medal)) %>% 
  count(noc)

As a third step, I will compute the percentage by first joining the tibble with the medals won only “medals_won” with the tibble that has the total of medals competed for, “medals_total”.

medal_percentage <- medals_won %>% 
  left_join(medals_total, by= c("noc")) %>%
  group_by(noc) %>% 
  mutate(percentage_m_won = n/n_total*100) %>% 
  ungroup() %>% 
  rename(country = noc)

The fourth step is to only keep the top 15 country I have selected earlier. To do so I am merging the top 15 countries with the tibble I just created with the percentage of won medals per country “medal_percentage”.

top_15 <- medals_count_15 %>% 
  select(-n) # to have a clean joined tibble since "n" exists in both medals_count_15 and medal_percentage

medal_percentage_15 <- top_15 %>% 
  left_join(medal_percentage) %>% 
  mutate(percentage_m_won = round(percentage_m_won, 1))

Creating a bar plot

As a next step, I have to create a bar plot that will show the percentage of medals won per country over the years.

medal_percentage_15 %>% 
  ggplot(aes(x= reorder(country, percentage_m_won),
              y=percentage_m_won,
             fill = country)) +
  geom_col() +
  theme_minimal()+
  labs(title = "Percentage of won medals for the top 15 countries",
       x = "countries",
       y = "percentage")

medals_total %>% 
  filter(noc == "GDR") %>% 
  pull(n_total) %>% 
  first()
## [1] 2645
medals_total %>% 
  filter(noc == "URS") %>% 
  pull(n_total) %>% 
  first()
## [1] 5685
Comment

I notice that overall countries with the most athletes competing for most events also win the most medals: URS, USA, RUS are in the top 4. An interesting point is that USA was the country with most events competed for with 18853 but has 30% of won medals versus 38% for GDR (East Germany) that had 2645 total competitors and URS has close to 45% of won medals for 5685 a total that is also lower than for the USA.

Explanation

Chi-Squared test

To further confirm whether or not there is a relationship between the country and the probability to win medals, I have to perform an analysis with a chi-squared test. The value of the chi-squared statistic will determine the strength of the relationship between the 2 categorical variables. If it is high, this means the relationship is strong, and vice versa. The two categorical values are: country and medal status (won/lost).

Hypothesis: Countries competing with most events have a higher probability to win medals. Null hypothesis: Regardless of the number of competed events, the probability to win medals is equivalent for countries.

First I have to take the original tibble “medals” join it with “country” to obtain the column for the countries which is the explanatory variable. Next I only filter the top 15 countries and create the column “status” with “won” and “lost” values depending on whether a medal was won for an event or not. This is the response variable.

medal_status <- medals %>% 
  left_join(country) %>% 
  rename(country = noc) %>% 
  filter(country %in% top_15$country) %>% 
  mutate(status = if_else(is.na(medal), "lost", "won"))

Finally, I perform a chi-squared test.

test_country_medal<- chisq_test(medal_status, formula = status ~ country)

test_country_medal

Comment

The p-value is 0. This value is below 0.05, this means that the null hypothesis can be rejected. As a consequence I can’t reject the hypothesis that there is a relationship between the countries - number of events competed for - and the probability to win medals for the top 15 countries.

Athletes and Sports

Part 6

Instructions

Create a scatterplot showing the average height versus the average weight of competitors per sport (one dot per sport). Add labels with the sport names for:

  • the largest average height
  • the largest average weight
  • the smallest average height
  • the smallest average weight
  • the largest average BMI
  • the smallest average BMI

It might be that the same dot qualifies for multiple labels. The formula to calculate Body Mass Index (BMI) is:

weight in kg / (height in meters)^2

Explanation

I need columns for:

  • average height

  • average weight

  • bmi (although not necessary I prefer to compute to keep an intermediary step)

  • average bmi

As a first step, I create a column with the height in meters. Then, the average will have to be the result of competitors per sport, so I have to group by sport to obtain those averages and create a column for each average. I also need to avoid the NAs.

athletes_medals <- athletes %>% 
  left_join(medals, by=c("id" = "athlete_id"))

athletes_medals_av_bmi <- athletes_medals %>%
  mutate(height_m = height /100) %>% 
  group_by(sport) %>% 
  summarise(average_weight = mean(weight, na.rm = TRUE),
         average_height_m = mean(height_m, na.rm = TRUE),
         bmi = weight/(height_m^2),
         average_bmi = mean(bmi, na.rm = TRUE)) %>% 
  ungroup()

To label the sports with:

  • the largest average height

  • the largest average weight

  • the smallest average height

  • the smallest average weight

  • the largest average BMI

  • the smallest average BMI

I have to create a separate data set with only those values. This tibble will be used in ggplot2. I take the tibble I just created “athletes_medals_av_bmi” with all the necessary columns and use distinct() to have unique values per sport, average weight, height and bmi, and then filter the extreme values for each category with max() and min().

sport_1 <- athletes_medals_av_bmi %>% 
  distinct(sport, average_weight, average_height_m, average_bmi) %>% 
  filter(
    average_weight == min(average_weight, na.rm = TRUE)|
      average_weight == max(average_weight, na.rm = TRUE)|
      average_bmi == min(average_bmi, na.rm= TRUE)|
      average_bmi == max(average_bmi, na.rm = TRUE)|
      average_height_m == min(average_height_m, na.rm = TRUE)|
      average_height_m == max(average_height_m, na.rm = TRUE))

Before using “sport_1” to highlight the extreme values in BMI, average height and weight, I add another column named “label” to specify why these data points are of interest.

sport_labels<- sport_1 %>% 
  mutate(label= case_when(
    sport == "Basketball"~ "Basketball: Max Height",
    sport == "Rhythmic Gymnastics" ~ "Rhythmic Gymnastics: Min Weight and BMI",
    sport == "Tug-Of-War" ~ "Tug-Of-War: Max Weight",
    sport == "Gymnastics" ~ "Gymnastics: Min Height",
    sport == "Weightlifting" ~ "Weightlifting: Max BMI"
  ))

Creating a scatter plot

This plot shows the average height per average weight. Each dot is a sport, the labelled ones are the extreme averages.

athletes_medals_av_bmi %>% 
  distinct(sport, average_weight, average_height_m, average_bmi) %>%
  ggplot(aes(x = average_height_m, y = average_weight)) +
  geom_point() +  
  geom_text_repel( 
    data = sport_labels, 
    aes(label = label),
    color = "purple",
    size = 2) +
  labs(
    title = "Average height vs weight by Sport (Labeling Extremes)",
    x = "Average height (m)",
    y = "Average weight (kg)"
  ) +
  theme_minimal()

Comment

The Rhythmic Gymnastics has the smallest average weight and bmi. Tug-Of-War has the largest average weight. Basketball has the largest average height. Gymnastics has the smallest average height. Weightlifting has the largest average bmi.

Part 7

Instructions

Create a line plot showing the number of medals given by year (one line for Gold, one line for Silver and one line for Bronze). Does it change over time? Use facets to separate the medals at Summer games and Winter games.

Comment the results below the plot.

Explanation

I will mostly rely on the “medals” tibble for this exercise. To create the line plot I need a column with the years and a column with the seasons (winter/summer). I will first create the 2 columns by separating the “games” column and make sure the year column is considered as an integer.

medals_7 <- medals %>% 
  separate(games, into = c("year", "season"), sep = " ") %>% 
  mutate(year = as.integer(year))

As a second step, I will count the number of gold, silver and bronze medals by creating a column for each medal type. Then, I will be able to group by year and count the number of gold, silver and bronze medals per year, again one column per medal type.

medals_n_types <- medals_7 %>% 
  mutate(n_gold = if_else(medal == "Gold", 1L, 0L), # 1L and 0L return integers
         n_silver = if_else(medal == "Silver", 1L, 0L),
         n_bronze = if_else(medal == "Bronze", 1L, 0L)) %>% 
  group_by(year, season) %>% 
  mutate(gold = sum(n_gold, na.rm = TRUE),
         silver = sum(n_silver, na.rm = TRUE),
         bronze = sum(n_bronze, na.rm = TRUE)) %>% 
  ungroup()

As a third step, I will pivot_longer() the columns counting the number of gold, silver, and bronze per year, so that they become rows, and their values fill the column “n_medals” that will match the years, season and medal type. The “medal_type” column is important because it will be useful to have one line per medal type in my plot.

medals_n_types_long <- medals_n_types %>% 
  pivot_longer(cols= c(gold, silver, bronze),
               names_to = "medal_type",
               values_to = "n_medals") %>% 
  distinct(year, season, medal_type, n_medals) %>% 
  paged_table()

Creating a line plot

This plot represents the number of medals per year. Each line is medal type. Each facet is a season.

medals_n_types_long %>% 
  ggplot(aes(x = year,
             y = n_medals))+
  geom_line(aes(group=medal_type, color = medal_type))+
  facet_wrap(vars(season))+
  theme_minimal() +
  labs(title = "Number of medals per year",
       subtitle = "Using facets to separate the Summer and Winter games",
       x = "Years",
       y = "Number of medals", 
       color = "Medal type" )

Comment

The lines start around 1900 for the summer games, whereas the lines start after 1920 for the winter games. I assume the first games were only in summer. The medals for summer games have been steadily increasing from 100 in 1900 to 700 early 2010. There was a important increase in 1920, after that the number of medals decreased. The winter games show a different trend: the medals increased progressively but slower than for the summer games. In 1930, there were around 50 medals per medal type and in 2010, there were only 200.