The excel file “olympics.xlsx” has 4 sheets that I have to load separately. The following code load the “athletes” sheet.
athletes <- read_excel("olympics.xlsx", sheet = "athletes") %>%
clean_names() %>%
glimpse()
## Rows: 135,571
## Columns: 5
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ name <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
## $ sex <chr> "M", "M", "M", "M", "F", "M", "M", "F", "M", "M", "M", "M", "F"…
## $ height <dbl> 180, 170, NA, NA, 185, 188, 183, 168, 186, NA, 182, 172, 159, 1…
## $ weight <dbl> 80.0, 60.0, NA, NA, 82.0, 75.0, 72.0, NA, 96.0, NA, 76.5, 70.0,…
The following code load the “country” sheet.
country <- read_excel("olympics.xlsx", sheet = "country") %>%
clean_names() %>%
glimpse()
## Rows: 187,452
## Columns: 4
## $ athlete_id <dbl> 1, 2, 3, 4, 5, 5, 5, 6, 6, 7, 7, 8, 9, 10, 11, 12, 13, 13, …
## $ age <dbl> 24, 23, 24, 34, 21, 25, 27, 31, 33, 31, 33, 18, 26, 26, 22,…
## $ noc <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "USA", "US…
## $ games <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer",…
The following code load the “games” sheet.
games <- read_excel("olympics.xlsx", sheet = "games") %>%
clean_names() %>%
glimpse()
## Rows: 52
## Columns: 4
## $ games <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
## $ year <dbl> 1992, 2012, 1920, 1900, 1988, 1992, 1994, 1932, 2002, 1952, 198…
## $ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
## $ city <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Albert…
I am asked to check the year 1956.
games %>%
filter(year == 1956)
Each year usually has 2 games: 1 for summer and 1 for winter. In this case according to the data, there was 3 games, 2 summer games took place in different cities: Melbourne and Stockholm. According to the IOC website (https://www.olympics.com/en/olympic-games/melbourne-1956) the 1956 summer games were held in Melbourne. Even though one equestrian competition was held in Stockholm which can be the cause of this anomaly. In the future analysis I will consider that the 1956 summer game was held in Melbourne.
The following code load the “medals” sheet.
medals <- read_excel("olympics.xlsx", sheet = "medals") %>%
clean_names() %>%
glimpse()
## Rows: 271,116
## Columns: 6
## $ athlete_id <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7,…
## $ games <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer",…
## $ team <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands…
## $ sport <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skat…
## $ event <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweigh…
## $ medal <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
I will use vis_miss() for as many tibble as possible to check the missing values. And then check for duplicates for the “id” columns especially.
vis_miss(athletes)
There is no duplicates in the “id” column of the “athletes” tibble.
athletes %>%
drop_na() %>%
summarise(mean_weight = mean(weight),
min_weight = min(weight),
max_weight = max(weight),
mean_height = mean(height),
min_height = min(height),
max_height = max(height))
The values from the weight and height seem realistic except for the maximum height of 226 cm and the minimum weight of 25 kg.
vis_miss(country)
The above vizualization shows that there are 4% of missing values in the “age” column in the “country” tibble. The rest of the columns don’t have any missing values.
country %>%
count(athlete_id) %>%
filter(n>1) %>%
arrange(desc(n)) %>%
paged_table()
There are several “athlete_id” duplicates in the “country” column. The reason is probably because the same athletes have participated in different games and maybe even for different countries.
vis_miss(games)
The above vizualization shows that there no missing values in the “games” tibble.
games %>%
count(year) %>%
filter(n>2)
1956 is the only year which is repeated three times.
games %>%
count(year) %>%
filter(n<2) %>%
paged_table()
On the other hand, several years are only mentioned once, as I assume there was only a game per year.
medals %>%
count(athlete_id) %>%
filter(n>1) %>%
arrange(desc(n)) %>%
paged_table()
There are over 57000 duplicates for the “athlete_id” in the “medal” tibble, this is because over the years many athletes have won several medals. I note that the number of occurrences of the id doesn’t mean the equivalent number of medals. Since there are some NAs in the “medal” column, as I will observe later.
medals %>%
filter(is.na(athlete_id))
Since the “medal” tibble is too heavy I cannot use vis_miss(). There fore I filtered by NA. I notice that there is no missing value in the “athlete_id” column.
medals %>%
tabyl(medal) %>%
kable()
| medal | n | percent | valid_percent |
|---|---|---|---|
| Bronze | 13295 | 0.0490381 | 0.3341880 |
| Gold | 13372 | 0.0493221 | 0.3361235 |
| Silver | 13116 | 0.0483778 | 0.3296886 |
| NA | 231333 | 0.8532621 | NA |
Still regarding the “medals” tibble, I can see that 85% of the values from the “medal” column are missing, 4.8% are silver medals, 4.9% are gold medals, 4.9% are bronze medals.
Have some athletes competed for different countries over time?
Use inline code to print the result and comment your findings. How would you explain these results?
To answer to this question I mostly need the “country” tibble. I can first look for the unique combinations for id and country (the “noc” column) to identify the athletes who competed for different countries by using distinct(). Now each row is an athlete competing for distinct countries. Next, I can group by the id, and count the number of rows per athlete id, which will indicate the number of country changes. To only keep the athletes who changed more than once, I use the filter function. The result is tibble “diff_country”.
Since there are more than 10 athletes who changed countries, I want to extract the athletes who changed the most times with the “top10_diff_country” thanks to slice_max(). I also have to use distinct() because there are several rows with the same id. At the end I create a new tibble called “names” with the athletes’ names and id in order to merge it with the last tibble I created.
diff_country <- country %>%
distinct(athlete_id, noc) %>%
group_by(athlete_id) %>%
mutate(n_country_changes = n()) %>%
ungroup() %>%
filter(n_country_changes >1)
top10_diff_country <- diff_country %>%
distinct(athlete_id, n_country_changes) %>%
slice_max(order_by = n_country_changes, n= 10, with_ties = FALSE)
names <- athletes %>%
select(id, name)
names_top10_diff_country <- top10_diff_country %>%
left_join(names, by=c("athlete_id" = "id"))
kable(names_top10_diff_country, caption="Athletes who competed for different countries over time")
| athlete_id | n_country_changes | name |
|---|---|---|
| 59021 | 4 | Makharbek Khazbiyevich Khadartsev |
| 66953 | 4 | Irina Lashko (-Furler-) |
| 72433 | 4 | Ilija Lupulesku |
| 108383 | 4 | Jasna ekari (Brajkovi-) |
| 111969 | 4 | Micha liwiski |
| 2425 | 3 | Yamil Aldama Pozo (-Dodds) |
| 4677 | 3 | Dmitry Borisovich Apanasenko |
| 5447 | 3 | Loedmila Arzjannikova |
| 5839 | 3 | Anzhelika Petrovna “Anzhela” Atroshchenko (-Kinet) |
| 7054 | 3 | Andrey Pavlovich Bakhvalov |
The number of athletes who competed for different countries is 3242. In the top 10, I see that Makharbek Khazbiyevich Khadartsev changed countries 4 times.
Who are the ten athletes that took part in most games?
For this exercise I can mostly use the “medals” tibble. This time, I’m interested in the number of games the athletes participated in. This is the unique combination I have to look for, again I can use distinct() for the id and the games. Then, group by the id, and create a column that counts the number of rows per athlete which equals to the number of games they participated in.
To only get the top 10 I create another tibble “games_athlete10” where I use distinct() on the id and number of games columns to have one row per athlete and use slice_max().
As a final step, I have to merge this tibble with the “names” that contains the athletes’ names and matching ids, the result is the tibble “games_athlete10_name”,
games_athlete <- medals %>%
distinct(athlete_id, games) %>%
group_by(athlete_id) %>%
mutate(n_games_athlete = n()) %>%
ungroup()
games_athlete10 <- games_athlete %>%
distinct(athlete_id, n_games_athlete) %>%
slice_max(order_by = n_games_athlete, n= 10, with_ties = FALSE)
games_athlete10_name <- games_athlete10 %>%
left_join(names, by=c("athlete_id" = "id"))
kable(games_athlete10_name, caption="Top ten athletes who took part in most games")
| athlete_id | n_games_athlete | name |
|---|---|---|
| 79855 | 10 | Ian Millar |
| 65378 | 9 | Afanasijs Kuzmins |
| 99155 | 9 | Hubert Raudaschl |
| 14388 | 8 | Francisco Boza Dibos |
| 26880 | 8 | Rajmond Debevec |
| 28051 | 8 | Piero D’Inzeo |
| 28052 | 8 | Raimondo D’Inzeo |
| 32458 | 8 | Paul Bert Elvstrm |
| 51618 | 8 | Josefa Idem-Guerrini |
| 61572 | 8 | Durward Randolph Knowles |
The athlete Ian Millar is the athlete who participated in most games, with a score of 10 games. With 9 games, Afanasijs Kuzmins and Hubert Raudaschl are the two athletes who come in second position. The rest of the top 10 participated in 8 games.
Who are the ten athletes that competed in the most events? Did some athletes take part in more than one event during a game?
I’m interested to know the number of events the athletes participated overall. For this exercise I want to count the unique combinations of athletes id and events. The same event only counts once. To isolate the unique combinations of athlete id-event, I use distinct() and then use group__by() the id to get the number of rows per id. This number is the number of events per athlete and is computed in the “n_events_athletes” column. The result of this operation is tibble “athletes_events”.
Next, I want to get one row per athlete and extract the top ten athlete that competed in most events. I use the distinct() function and then slice_max().
Finally I merge it with the “names” tibble.
athletes_events <- medals %>%
distinct(athlete_id, event) %>%
group_by(athlete_id) %>%
mutate(n_events_athletes = n()) %>%
ungroup()
athletes_events_10 <- athletes_events %>%
distinct(athlete_id, n_events_athletes) %>%
slice_max(order_by = n_events_athletes, n=10, with_ties = FALSE)
athletes_events_10_names <- athletes_events_10 %>%
left_join(names, by= c("athlete_id" = "id"))
kable(athletes_events_10_names, caption="Athletes who competed in most events")
| athlete_id | n_events_athletes | name |
|---|---|---|
| 119591 | 33 | Ioannis Theofilakis |
| 119590 | 28 | Alexandros Theofilakis |
| 18255 | 24 | Gustaf Eric Carlberg |
| 18257 | 22 | Gustaf Vilhelm Carlberg |
| 76886 | 22 | Frangiskos D. Mavrommatis |
| 124156 | 20 | Paul Van Asbroeck |
| 73260 | 19 | Lars Jrgen Madsen |
| 81899 | 19 | Lon Ernest Moreaux |
| 25946 | 18 | Marie Joseph “Raoul” le Borgne de Boigne |
| 55421 | 18 | Lon douard Johnson |
Ioannis Theofilakis is the athlete who competed in the most events (33), after him comes Alexandros Theofilakis who participated in 28 events. These events are counted regardless of the game, so the second question remains to be answered.
For this part, I’m interested in the number of events the athletes participated in during each game. This time, I have to first group by the athletes’ id and the games. Within these groups, I now want to isolate the unique combinations of athletes and events with distinct(). Then, create a column “n_event_per_game” that counts the number of rows for these unique combinations within the groups. The result is the tibble “athlete_event_game”.
Next, I want each row to be a distinct athlete, game and number of event per game and filter out the athletes who did one event per game. The result is tibble “athlete_event_game_diff”.
Finally, I merge it with the “names” tibble.
athlete_event_game <- medals %>%
group_by(athlete_id, games) %>%
distinct(athlete_id, event) %>%
mutate(n_event_per_game = n()) %>%
ungroup()
athlete_event_game_diff <- athlete_event_game %>%
distinct(athlete_id, games, n_event_per_game) %>%
filter(n_event_per_game > 1) %>%
arrange(desc(n_event_per_game))
athlete_event_game_diff_names <- athlete_event_game_diff %>%
left_join(names, by=c("athlete_id" = "id"))
athlete_event_game_diff_names %>%
head(10) %>%
kable()
| athlete_id | games | n_event_per_game | name |
|---|---|---|---|
| 68189 | 1920 Summer | 15 | Willis Augustus Lee, Jr. |
| 113935 | 1920 Summer | 13 | Lloyd Spencer Spooner |
| 25946 | 1906 Summer | 12 | Marie Joseph “Raoul” le Borgne de Boigne |
| 34282 | 1906 Summer | 12 | Maurice Faure |
| 36489 | 1906 Summer | 12 | Jean Fouconnier |
| 107613 | 1896 Summer | 12 | Carl Schuhmann |
| 119591 | 1912 Summer | 12 | Ioannis Theofilakis |
| 73260 | 1920 Summer | 11 | Lars Jrgen Madsen |
| 76886 | 1912 Summer | 11 | Frangiskos D. Mavrommatis |
| 78929 | 1906 Summer | 11 | Sidney Louis Walter Merlin |
During the games of 1920 Summer, Willis Augustus Lee, Jr. participated in 15 events. In second position, comes Lloyd Spencer Spooner who participated in 13 events during the games of 1920 Summer. In third position, comes Marie Joseph “Raoul” le Borgne de Boigne who participated in 1906 Summer events during the games of 1906 Summer.
Create a new table showing the number of medals per country (rows) and per year (column). Keep only the 15 countries with the most medals overall.
I need to merge the “medals” tibble and the “country” tibble because the “team” column is not as precise as the “noc” column from the “country” tibble (I noticed that sometimes the team column has more than one country, or other values that are not countries).
After joining “country” and “medals”, I have to keep the 15 countries with the most medals overall, which means I can discard all rows with no medals, and then count the number of rows per country. The result it tibble “medals_count”.
Finally I use slice_max() to only obtain the first 15 countries with most medals overall. The result is tibble “medals_count_15”.
medals_count <- medals %>%
left_join(country) %>%
filter(!is.na(medal)) %>%
count(noc, sort = TRUE)
medals_count_15 <- medals_count %>%
slice_max(n, n=15) %>%
rename(country=noc)
kable(medals_count_15, caption = "Top 15 countries with most medals overall")
| country | n |
|---|---|
| USA | 5637 |
| URS | 2503 |
| GER | 2165 |
| GBR | 2068 |
| FRA | 1777 |
| ITA | 1637 |
| SWE | 1536 |
| CAN | 1352 |
| AUS | 1320 |
| RUS | 1165 |
| HUN | 1135 |
| NED | 1040 |
| NOR | 1033 |
| GDR | 1005 |
| CHN | 989 |
This table show the 15 countries with most medals overall. Yet, this table lost many data points that interest me, such as the years and number of medals per years. I’m also still missing the number of medals per country, per year.
I can compute the number of medals per country and year in a new column. To make sure I don’t build on previous errors, I decide to start again with the original tibble “medals” and join it with “country”. Next, I crate a “year” column by extracting every 4 digits from the column “games”. Finally, I can use group_by() on the country and year and create a new column “n_medals_country_year”.
medals_count_year <- medals %>%
left_join(country) %>%
filter(!is.na(medal)) %>%
mutate(year = str_extract(games, "\\b\\d{4}\\b")) %>%
group_by(year, noc) %>%
mutate(n_medals_country_year = n(), year, noc) %>%
ungroup()
Now, I can join the top 15 country selection with the most medals overall tibble: “medals_count_15” with “medals_count_year” and have country = noc as a key. I will then have to use the pivot function to have one year per column, and the number of medals per year, per country. In this question there is no distinction between a glod, silver or bronze medal, all types of medals are taken into account and summed.
top_15_countries_medals_year <- medals_count_15 %>%
left_join(medals_count_year, by= c("country" = "noc")) %>%
select(country, year, n_medals_country_year) %>%
distinct(country, year, n_medals_country_year) %>%
pivot_wider(names_from = "year", values_from = "n_medals_country_year", values_fill =0) %>%
select(country, sort(names(.)[-1], decreasing = TRUE)) %>%
mutate(total = rowSums(across(-country)))
library(kableExtra)
top_15_countries_medals_year %>%
kable(format = "html", caption = "Number of medals per year for top 15 country with most medals overall") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "100%", height = "400px")
| country | 2016 | 2014 | 2012 | 2010 | 2008 | 2006 | 2004 | 2002 | 2000 | 1998 | 1996 | 1994 | 1992 | 1988 | 1984 | 1980 | 1976 | 1972 | 1968 | 1964 | 1960 | 1956 | 1952 | 1948 | 1936 | 1932 | 1928 | 1924 | 1920 | 1912 | 1908 | 1906 | 1904 | 1900 | 1896 | total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| USA | 264 | 64 | 248 | 97 | 317 | 52 | 263 | 84 | 242 | 34 | 259 | 19 | 238 | 214 | 361 | 30 | 175 | 195 | 173 | 177 | 152 | 149 | 164 | 168 | 112 | 223 | 102 | 194 | 194 | 107 | 65 | 24 | 394 | 63 | 20 | 5637 |
| URS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 366 | 56 | 496 | 342 | 259 | 229 | 221 | 211 | 206 | 117 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2503 |
| GER | 159 | 36 | 94 | 54 | 99 | 54 | 149 | 61 | 118 | 44 | 124 | 40 | 236 | 0 | 0 | 0 | 0 | 0 | 0 | 126 | 98 | 54 | 52 | 0 | 231 | 57 | 82 | 0 | 0 | 53 | 21 | 30 | 16 | 45 | 32 | 2165 |
| GBR | 145 | 10 | 126 | 1 | 81 | 1 | 57 | 6 | 54 | 4 | 26 | 3 | 50 | 54 | 73 | 48 | 33 | 29 | 19 | 30 | 28 | 46 | 32 | 63 | 53 | 35 | 59 | 95 | 111 | 170 | 368 | 39 | 2 | 108 | 9 | 2068 |
| FRA | 96 | 18 | 82 | 14 | 77 | 15 | 53 | 15 | 66 | 15 | 51 | 11 | 69 | 31 | 70 | 30 | 21 | 25 | 36 | 38 | 15 | 33 | 41 | 77 | 45 | 42 | 52 | 120 | 141 | 26 | 40 | 64 | 2 | 235 | 11 | 1777 |
| ITA | 72 | 14 | 68 | 5 | 42 | 25 | 104 | 21 | 65 | 17 | 71 | 32 | 66 | 37 | 65 | 40 | 35 | 34 | 41 | 61 | 89 | 55 | 56 | 68 | 76 | 77 | 72 | 51 | 84 | 25 | 8 | 56 | 0 | 5 | 0 | 1637 |
| SWE | 28 | 51 | 22 | 18 | 7 | 64 | 12 | 26 | 32 | 7 | 31 | 24 | 42 | 46 | 64 | 43 | 12 | 26 | 25 | 43 | 16 | 49 | 93 | 88 | 45 | 35 | 53 | 81 | 154 | 190 | 84 | 21 | 0 | 4 | 0 | 1536 |
| CAN | 69 | 86 | 55 | 90 | 35 | 69 | 17 | 74 | 31 | 49 | 50 | 37 | 80 | 29 | 91 | 2 | 26 | 12 | 30 | 12 | 30 | 39 | 21 | 22 | 35 | 59 | 45 | 29 | 16 | 8 | 52 | 2 | 48 | 2 | 0 | 1352 |
| AUS | 82 | 3 | 114 | 3 | 149 | 2 | 157 | 2 | 183 | 1 | 132 | 4 | 57 | 35 | 52 | 13 | 23 | 20 | 51 | 44 | 46 | 67 | 20 | 16 | 1 | 5 | 4 | 11 | 7 | 0 | 0 | 3 | 4 | 6 | 3 | 1320 |
| RUS | 115 | 68 | 140 | 25 | 142 | 41 | 189 | 38 | 187 | 52 | 115 | 36 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 3 | 0 | 0 | 0 | 0 | 1165 |
| HUN | 22 | 0 | 26 | 0 | 27 | 0 | 40 | 0 | 53 | 0 | 43 | 0 | 45 | 44 | 0 | 63 | 55 | 81 | 81 | 56 | 66 | 66 | 104 | 64 | 43 | 35 | 21 | 22 | 0 | 30 | 20 | 13 | 4 | 5 | 6 | 1135 |
| NED | 47 | 29 | 69 | 11 | 61 | 13 | 77 | 8 | 79 | 11 | 73 | 4 | 37 | 52 | 41 | 10 | 20 | 15 | 23 | 33 | 5 | 0 | 21 | 42 | 37 | 13 | 57 | 27 | 50 | 26 | 16 | 6 | 0 | 27 | 0 | 1040 |
| NOR | 19 | 36 | 17 | 39 | 22 | 23 | 7 | 41 | 43 | 40 | 23 | 30 | 47 | 33 | 19 | 15 | 16 | 25 | 28 | 15 | 11 | 7 | 32 | 31 | 46 | 10 | 24 | 39 | 137 | 80 | 42 | 25 | 2 | 9 | 0 | 1033 |
| GDR | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 213 | 33 | 303 | 227 | 171 | 58 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1005 |
| CHN | 113 | 12 | 125 | 19 | 184 | 13 | 94 | 16 | 79 | 14 | 106 | 3 | 85 | 52 | 74 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 989 |
Is there a relationship between country and the probability of winning a medal?
We know that some countries win more medals than others. In this part we want to work out whether some countries win more medals simply because they have more athletes competing for more events, or whether their athletes are in fact performing better in general.
For this part, look only at the top 15 countries that you established in part 6.
For this exercise, I can’t ignore the NAs in the “medal” column, because to compute the percentage of medals won, I need the total medals competed for. The percentage of medal won per country = medals won * 100 /total medals competed for.
As a first step I count the rows per country in the “medals1” tibble, because it already has a year and country column. The rows per country are the number of athletes competing for each event, i.e. the number of medals the country is competing for.
medals_total <- medals %>%
left_join(country) %>%
count(noc) %>%
rename(n_total = n)
As a second step, I create another tibble that only contains the medals won. I filter the NAs medals out and count again the rows per country. This tibble “medals_won” will only have 2 columns “noc” and “n”.
medals_won <- medals %>%
left_join(country) %>%
filter(!is.na(medal)) %>%
count(noc)
As a third step, I will compute the percentage by first joining the tibble with the medals won only “medals_won” with the tibble that has the total of medals competed for, “medals_total”.
medal_percentage <- medals_won %>%
left_join(medals_total, by= c("noc")) %>%
group_by(noc) %>%
mutate(percentage_m_won = n/n_total*100) %>%
ungroup() %>%
rename(country = noc)
The fourth step is to only keep the top 15 country I have selected earlier. To do so I am merging the top 15 countries with the tibble I just created with the percentage of won medals per country “medal_percentage”.
top_15 <- medals_count_15 %>%
select(-n) # to have a clean joined tibble since "n" exists in both medals_count_15 and medal_percentage
medal_percentage_15 <- top_15 %>%
left_join(medal_percentage) %>%
mutate(percentage_m_won = round(percentage_m_won, 1))
As a next step, I have to create a bar plot that will show the percentage of medals won per country over the years.
medal_percentage_15 %>%
ggplot(aes(x= reorder(country, percentage_m_won),
y=percentage_m_won,
fill = country)) +
geom_col() +
theme_minimal()+
labs(title = "Percentage of won medals for the top 15 countries",
x = "countries",
y = "percentage")
medals_total %>%
filter(noc == "GDR") %>%
pull(n_total) %>%
first()
## [1] 2645
medals_total %>%
filter(noc == "URS") %>%
pull(n_total) %>%
first()
## [1] 5685
I notice that overall countries with the most athletes competing for most events also win the most medals: URS, USA, RUS are in the top 4. An interesting point is that USA was the country with most events competed for with 18853 but has 30% of won medals versus 38% for GDR (East Germany) that had 2645 total competitors and URS has close to 45% of won medals for 5685 a total that is also lower than for the USA.
To further confirm whether or not there is a relationship between the country and the probability to win medals, I have to perform an analysis with a chi-squared test. The value of the chi-squared statistic will determine the strength of the relationship between the 2 categorical variables. If it is high, this means the relationship is strong, and vice versa. The two categorical values are: country and medal status (won/lost).
Hypothesis: Countries competing with most events have a higher probability to win medals. Null hypothesis: Regardless of the number of competed events, the probability to win medals is equivalent for countries.
First I have to take the original tibble “medals” join it with “country” to obtain the column for the countries which is the explanatory variable. Next I only filter the top 15 countries and create the column “status” with “won” and “lost” values depending on whether a medal was won for an event or not. This is the response variable.
medal_status <- medals %>%
left_join(country) %>%
rename(country = noc) %>%
filter(country %in% top_15$country) %>%
mutate(status = if_else(is.na(medal), "lost", "won"))
Finally, I perform a chi-squared test.
test_country_medal<- chisq_test(medal_status, formula = status ~ country)
test_country_medal
The p-value is 0. This value is below 0.05, this means that the null hypothesis can be rejected. As a consequence I can’t reject the hypothesis that there is a relationship between the countries - number of events competed for - and the probability to win medals for the top 15 countries.
Create a scatterplot showing the average height versus the average weight of competitors per sport (one dot per sport). Add labels with the sport names for:
It might be that the same dot qualifies for multiple labels. The formula to calculate Body Mass Index (BMI) is:
weight in kg / (height in meters)^2
I need columns for:
average height
average weight
bmi (although not necessary I prefer to compute to keep an intermediary step)
average bmi
As a first step, I create a column with the height in meters. Then, the average will have to be the result of competitors per sport, so I have to group by sport to obtain those averages and create a column for each average. I also need to avoid the NAs.
athletes_medals <- athletes %>%
left_join(medals, by=c("id" = "athlete_id"))
athletes_medals_av_bmi <- athletes_medals %>%
mutate(height_m = height /100) %>%
group_by(sport) %>%
summarise(average_weight = mean(weight, na.rm = TRUE),
average_height_m = mean(height_m, na.rm = TRUE),
bmi = weight/(height_m^2),
average_bmi = mean(bmi, na.rm = TRUE)) %>%
ungroup()
To label the sports with:
the largest average height
the largest average weight
the smallest average height
the smallest average weight
the largest average BMI
the smallest average BMI
I have to create a separate data set with only those values. This tibble will be used in ggplot2. I take the tibble I just created “athletes_medals_av_bmi” with all the necessary columns and use distinct() to have unique values per sport, average weight, height and bmi, and then filter the extreme values for each category with max() and min().
sport_1 <- athletes_medals_av_bmi %>%
distinct(sport, average_weight, average_height_m, average_bmi) %>%
filter(
average_weight == min(average_weight, na.rm = TRUE)|
average_weight == max(average_weight, na.rm = TRUE)|
average_bmi == min(average_bmi, na.rm= TRUE)|
average_bmi == max(average_bmi, na.rm = TRUE)|
average_height_m == min(average_height_m, na.rm = TRUE)|
average_height_m == max(average_height_m, na.rm = TRUE))
Before using “sport_1” to highlight the extreme values in BMI, average height and weight, I add another column named “label” to specify why these data points are of interest.
sport_labels<- sport_1 %>%
mutate(label= case_when(
sport == "Basketball"~ "Basketball: Max Height",
sport == "Rhythmic Gymnastics" ~ "Rhythmic Gymnastics: Min Weight and BMI",
sport == "Tug-Of-War" ~ "Tug-Of-War: Max Weight",
sport == "Gymnastics" ~ "Gymnastics: Min Height",
sport == "Weightlifting" ~ "Weightlifting: Max BMI"
))
This plot shows the average height per average weight. Each dot is a sport, the labelled ones are the extreme averages.
athletes_medals_av_bmi %>%
distinct(sport, average_weight, average_height_m, average_bmi) %>%
ggplot(aes(x = average_height_m, y = average_weight)) +
geom_point() +
geom_text_repel(
data = sport_labels,
aes(label = label),
color = "purple",
size = 2) +
labs(
title = "Average height vs weight by Sport (Labeling Extremes)",
x = "Average height (m)",
y = "Average weight (kg)"
) +
theme_minimal()
The Rhythmic Gymnastics has the smallest average weight and bmi. Tug-Of-War has the largest average weight. Basketball has the largest average height. Gymnastics has the smallest average height. Weightlifting has the largest average bmi.
Create a line plot showing the number of medals given by year (one line for Gold, one line for Silver and one line for Bronze). Does it change over time? Use facets to separate the medals at Summer games and Winter games.
Comment the results below the plot.
I will mostly rely on the “medals” tibble for this exercise. To create the line plot I need a column with the years and a column with the seasons (winter/summer). I will first create the 2 columns by separating the “games” column and make sure the year column is considered as an integer.
medals_7 <- medals %>%
separate(games, into = c("year", "season"), sep = " ") %>%
mutate(year = as.integer(year))
As a second step, I will count the number of gold, silver and bronze medals by creating a column for each medal type. Then, I will be able to group by year and count the number of gold, silver and bronze medals per year, again one column per medal type.
medals_n_types <- medals_7 %>%
mutate(n_gold = if_else(medal == "Gold", 1L, 0L), # 1L and 0L return integers
n_silver = if_else(medal == "Silver", 1L, 0L),
n_bronze = if_else(medal == "Bronze", 1L, 0L)) %>%
group_by(year, season) %>%
mutate(gold = sum(n_gold, na.rm = TRUE),
silver = sum(n_silver, na.rm = TRUE),
bronze = sum(n_bronze, na.rm = TRUE)) %>%
ungroup()
As a third step, I will pivot_longer() the columns counting the number of gold, silver, and bronze per year, so that they become rows, and their values fill the column “n_medals” that will match the years, season and medal type. The “medal_type” column is important because it will be useful to have one line per medal type in my plot.
medals_n_types_long <- medals_n_types %>%
pivot_longer(cols= c(gold, silver, bronze),
names_to = "medal_type",
values_to = "n_medals") %>%
distinct(year, season, medal_type, n_medals) %>%
paged_table()
This plot represents the number of medals per year. Each line is medal type. Each facet is a season.
medals_n_types_long %>%
ggplot(aes(x = year,
y = n_medals))+
geom_line(aes(group=medal_type, color = medal_type))+
facet_wrap(vars(season))+
theme_minimal() +
labs(title = "Number of medals per year",
subtitle = "Using facets to separate the Summer and Winter games",
x = "Years",
y = "Number of medals",
color = "Medal type" )
The lines start around 1900 for the summer games, whereas the lines start after 1920 for the winter games. I assume the first games were only in summer. The medals for summer games have been steadily increasing from 100 in 1900 to 700 early 2010. There was a important increase in 1920, after that the number of medals decreased. The winter games show a different trend: the medals increased progressively but slower than for the summer games. In 1930, there were around 50 medals per medal type and in 2010, there were only 200.
Comment
The above vizualization shows that there are 25% and 26% of missing values in the “height” and “weight” column in the “athlete” tibble. The rest of the columns don’t have any missing values.