Load the data

The excel file “olympics.xlsx” has 4 sheets that I have to load separately. The following code load the “athletes” sheet.

athletes <- read_excel("olympics.xlsx", sheet = "athletes") %>%
  clean_names() %>% 
  glimpse()

## Rows: 135,571
## Columns: 5
## $ id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
## $ sex    <chr> "M", "M", "M", "M", "F", "M", "M", "F", "M", "M", "M", "M", "F"…
## $ height <dbl> 180, 170, NA, NA, 185, 188, 183, 168, 186, NA, 182, 172, 159, 1…
## $ weight <dbl> 80.0, 60.0, NA, NA, 82.0, 75.0, 72.0, NA, 96.0, NA, 76.5, 70.0,…

The following code load the “country” sheet.

country <- read_excel("olympics.xlsx", sheet = "country") %>% 
  clean_names() %>% 
  glimpse()

## Rows: 187,452
## Columns: 4
## $ athlete_id <dbl> 1, 2, 3, 4, 5, 5, 5, 6, 6, 7, 7, 8, 9, 10, 11, 12, 13, 13, …
## $ age        <dbl> 24, 23, 24, 34, 21, 25, 27, 31, 33, 31, 33, 18, 26, 26, 22,…
## $ noc        <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "USA", "US…
## $ games      <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer",…

The following code load the “games” sheet.

games <- read_excel("olympics.xlsx", sheet = "games") %>% 
  clean_names() %>% 
  glimpse()

## Rows: 52
## Columns: 4
## $ games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
## $ year   <dbl> 1992, 2012, 1920, 1900, 1988, 1992, 1994, 1932, 2002, 1952, 198…
## $ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
## $ city   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Albert…

I am asked to check the year 1956.

games %>% 
  filter(year == 1956)

Comment

Each year usually has 2 games: 1 for summer and 1 for winter. In this case according to the data, there was 3 games, 2 summer games took place in different cities: Melbourne and Stockholm. According to the IOC website (https://www.olympics.com/en/olympic-games/melbourne-1956) the 1956 summer games were held in Melbourne. Even though one equestrian competition was held in Stockholm which can be the cause of this anomaly. In the future analysis I will consider that the 1956 summer game was held in Melbourne.

The following code load the “medals” sheet.

medals <- read_excel("olympics.xlsx", sheet = "medals") %>% 
  clean_names() %>% 
  glimpse()

## Rows: 271,116
## Columns: 6
## $ athlete_id <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7,…
## $ games      <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer",…
## $ team       <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands…
## $ sport      <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skat…
## $ event      <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweigh…
## $ medal      <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Data exploration

I will use vis_miss() for as many tibble as possible to check the missing values. And then check for duplicates for the “id” columns especially.

vis_miss(athletes)

Comment

The above vizualization shows that there are 25% and 26% of missing values in the “height” and “weight” column in the “athlete” tibble. The rest of the columns don’t have any missing values.

athletes %>% 
  count(id) %>% 
  filter(n>1)

Comment

There is no duplicates in the “id” column of the “athletes” tibble.

athletes %>%
  drop_na() %>% 
  summarise(mean_weight = mean(weight), 
            min_weight = min(weight), 
            max_weight = max(weight),
            mean_height = mean(height),
            min_height = min(height),
            max_height = max(height))

Comment

The values from the weight and height seem realistic except for the maximum height of 226 cm and the minimum weight of 25 kg.

vis_miss(country)

Comment

The above vizualization shows that there are 4% of missing values in the “age” column in the “country” tibble. The rest of the columns don’t have any missing values.

country %>% 
  count(athlete_id) %>% 
  filter(n>1) %>% 
  arrange(desc(n)) %>% 
  paged_table()

Comment

There are several “athlete_id” duplicates in the “country” column. The reason is probably because the same athletes have participated in different games and maybe even for different countries.

vis_miss(games)

Comment

The above vizualization shows that there no missing values in the “games” tibble.

games %>% 
  count(year) %>% 
  filter(n>2)

Comment

1956 is the only year which is repeated three times.

games %>% 
  count(year) %>% 
  filter(n<2) %>% 
  paged_table()

Comment

On the other hand, several years are only mentioned once, as I assume there was only a game per year.

medals %>% 
  count(athlete_id) %>% 
  filter(n>1) %>% 
  arrange(desc(n)) %>% 
  paged_table()

Comment

There are over 57000 duplicates for the “athlete_id” in the “medal” tibble, this is because over the years many athletes have won several medals. I note that the number of occurrences of the id doesn’t mean the equivalent number of medals. Since there are some NAs in the “medal” column, as I will observe later.

medals %>% 
  filter(is.na(athlete_id))

Comment

Since the “medal” tibble is too heavy I cannot use vis_miss(). There fore I filtered by NA. I notice that there is no missing value in the “athlete_id” column.

medals %>%
  tabyl(medal) %>% 
  kable()

medal	n	percent	valid_percent
Bronze	13295	0.0490381	0.3341880
Gold	13372	0.0493221	0.3361235
Silver	13116	0.0483778	0.3296886
NA	231333	0.8532621	NA

Comment

Still regarding the “medals” tibble, I can see that 85% of the values from the “medal” column are missing, 4.8% are silver medals, 4.9% are gold medals, 4.9% are bronze medals.

Data analysis

Athletes info

Part 1

Instructions

Have some athletes competed for different countries over time?

Use inline code to print the result and comment your findings. How would you explain these results?

Explanation

To answer to this question I mostly need the “country” tibble. I can first look for the unique combinations for id and country (the “noc” column) to identify the athletes who competed for different countries by using distinct(). Now each row is an athlete competing for distinct countries. Next, I can group by the id, and count the number of rows per athlete id, which will indicate the number of country changes. To only keep the athletes who changed more than once, I use the filter function. The result is tibble “diff_country”.

Since there are more than 10 athletes who changed countries, I want to extract the athletes who changed the most times with the “top10_diff_country” thanks to slice_max(). I also have to use distinct() because there are several rows with the same id. At the end I create a new tibble called “names” with the athletes’ names and id in order to merge it with the last tibble I created.

diff_country <- country %>% 
  distinct(athlete_id, noc) %>%
  group_by(athlete_id) %>% 
  mutate(n_country_changes = n()) %>% 
  ungroup() %>% 
  filter(n_country_changes >1)

top10_diff_country <- diff_country %>% 
  distinct(athlete_id, n_country_changes) %>% 
  slice_max(order_by = n_country_changes, n= 10, with_ties = FALSE)

names <- athletes %>% 
  select(id, name)

names_top10_diff_country <- top10_diff_country %>% 
  left_join(names, by=c("athlete_id" = "id"))

kable(names_top10_diff_country, caption="Athletes who competed for different countries over time")

Athletes who competed for different countries over time
athlete_id	n_country_changes	name
59021	4	Makharbek Khazbiyevich Khadartsev
66953	4	Irina Lashko (-Furler-)
72433	4	Ilija Lupulesku
108383	4	Jasna ekari (Brajkovi-)
111969	4	Micha liwiski
2425	3	Yamil Aldama Pozo (-Dodds)
4677	3	Dmitry Borisovich Apanasenko
5447	3	Loedmila Arzjannikova
5839	3	Anzhelika Petrovna “Anzhela” Atroshchenko (-Kinet)
7054	3	Andrey Pavlovich Bakhvalov

Comment

The number of athletes who competed for different countries is 3242. In the top 10, I see that Makharbek Khazbiyevich Khadartsev changed countries 4 times.

Part 2

Instructions

Who are the ten athletes that took part in most games?

Explanation

For this exercise I can mostly use the “medals” tibble. This time, I’m interested in the number of games the athletes participated in. This is the unique combination I have to look for, again I can use distinct() for the id and the games. Then, group by the id, and create a column that counts the number of rows per athlete which equals to the number of games they participated in.

To only get the top 10 I create another tibble “games_athlete10” where I use distinct() on the id and number of games columns to have one row per athlete and use slice_max().

As a final step, I have to merge this tibble with the “names” that contains the athletes’ names and matching ids, the result is the tibble “games_athlete10_name”,

games_athlete <- medals %>% 
  distinct(athlete_id, games) %>% 
  group_by(athlete_id) %>% 
  mutate(n_games_athlete = n()) %>% 
  ungroup()

games_athlete10 <- games_athlete %>% 
  distinct(athlete_id, n_games_athlete) %>% 
  slice_max(order_by = n_games_athlete, n= 10, with_ties = FALSE)

games_athlete10_name <- games_athlete10 %>% 
  left_join(names, by=c("athlete_id" = "id")) 

kable(games_athlete10_name, caption="Top ten athletes who took part in most games")

Top ten athletes who took part in most games
athlete_id	n_games_athlete	name
79855	10	Ian Millar
65378	9	Afanasijs Kuzmins
99155	9	Hubert Raudaschl
14388	8	Francisco Boza Dibos
26880	8	Rajmond Debevec
28051	8	Piero D’Inzeo
28052	8	Raimondo D’Inzeo
32458	8	Paul Bert Elvstrm
51618	8	Josefa Idem-Guerrini
61572	8	Durward Randolph Knowles

Comment

The athlete Ian Millar is the athlete who participated in most games, with a score of 10 games. With 9 games, Afanasijs Kuzmins and Hubert Raudaschl are the two athletes who come in second position. The rest of the top 10 participated in 8 games.

Part 3

Instructions

Who are the ten athletes that competed in the most events? Did some athletes take part in more than one event during a game?

Explanation

I’m interested to know the number of events the athletes participated overall. For this exercise I want to count the unique combinations of athletes id and events. The same event only counts once. To isolate the unique combinations of athlete id-event, I use distinct() and then use group__by() the id to get the number of rows per id. This number is the number of events per athlete and is computed in the “n_events_athletes” column. The result of this operation is tibble “athletes_events”.

Next, I want to get one row per athlete and extract the top ten athlete that competed in most events. I use the distinct() function and then slice_max().

Finally I merge it with the “names” tibble.

athletes_events <- medals %>% 
  distinct(athlete_id, event) %>% 
  group_by(athlete_id) %>% 
  mutate(n_events_athletes = n()) %>% 
  ungroup()

athletes_events_10 <- athletes_events %>% 
  distinct(athlete_id, n_events_athletes) %>% 
  slice_max(order_by = n_events_athletes, n=10, with_ties = FALSE)

athletes_events_10_names <- athletes_events_10 %>% 
  left_join(names, by= c("athlete_id" = "id"))

kable(athletes_events_10_names, caption="Athletes who competed in most events")

Athletes who competed in most events
athlete_id	n_events_athletes	name
119591	33	Ioannis Theofilakis
119590	28	Alexandros Theofilakis
18255	24	Gustaf Eric Carlberg
18257	22	Gustaf Vilhelm Carlberg
76886	22	Frangiskos D. Mavrommatis
124156	20	Paul Van Asbroeck
73260	19	Lars Jrgen Madsen
81899	19	Lon Ernest Moreaux
25946	18	Marie Joseph “Raoul” le Borgne de Boigne
55421	18	Lon douard Johnson

Comment

Ioannis Theofilakis is the athlete who competed in the most events (33), after him comes Alexandros Theofilakis who participated in 28 events. These events are counted regardless of the game, so the second question remains to be answered.

Explanation

For this part, I’m interested in the number of events the athletes participated in during each game. This time, I have to first group by the athletes’ id and the games. Within these groups, I now want to isolate the unique combinations of athletes and events with distinct(). Then, create a column “n_event_per_game” that counts the number of rows for these unique combinations within the groups. The result is the tibble “athlete_event_game”.

Next, I want each row to be a distinct athlete, game and number of event per game and filter out the athletes who did one event per game. The result is tibble “athlete_event_game_diff”.

Finally, I merge it with the “names” tibble.

athlete_event_game <- medals %>% 
  group_by(athlete_id, games) %>% 
  distinct(athlete_id, event) %>% 
  mutate(n_event_per_game = n()) %>% 
  ungroup()
  
athlete_event_game_diff <- athlete_event_game %>% 
  distinct(athlete_id, games, n_event_per_game) %>% 
  filter(n_event_per_game > 1) %>% 
  arrange(desc(n_event_per_game))

athlete_event_game_diff_names <- athlete_event_game_diff %>% 
  left_join(names, by=c("athlete_id" = "id"))

athlete_event_game_diff_names %>% 
  head(10) %>% 
  kable()

athlete_id	games	n_event_per_game	name
68189	1920 Summer	15	Willis Augustus Lee, Jr.
113935	1920 Summer	13	Lloyd Spencer Spooner
25946	1906 Summer	12	Marie Joseph “Raoul” le Borgne de Boigne
34282	1906 Summer	12	Maurice Faure
36489	1906 Summer	12	Jean Fouconnier
107613	1896 Summer	12	Carl Schuhmann
119591	1912 Summer	12	Ioannis Theofilakis
73260	1920 Summer	11	Lars Jrgen Madsen
76886	1912 Summer	11	Frangiskos D. Mavrommatis
78929	1906 Summer	11	Sidney Louis Walter Merlin

Comment

During the games of 1920 Summer, Willis Augustus Lee, Jr. participated in 15 events. In second position, comes Lloyd Spencer Spooner who participated in 13 events during the games of 1920 Summer. In third position, comes Marie Joseph “Raoul” le Borgne de Boigne who participated in 1906 Summer events during the games of 1906 Summer.

Countries info

Part 4

Instructions

Create a new table showing the number of medals per country (rows) and per year (column). Keep only the 15 countries with the most medals overall.

Explanations

I need to merge the “medals” tibble and the “country” tibble because the “team” column is not as precise as the “noc” column from the “country” tibble (I noticed that sometimes the team column has more than one country, or other values that are not countries).

After joining “country” and “medals”, I have to keep the 15 countries with the most medals overall, which means I can discard all rows with no medals, and then count the number of rows per country. The result it tibble “medals_count”.

Finally I use slice_max() to only obtain the first 15 countries with most medals overall. The result is tibble “medals_count_15”.

medals_count <- medals %>% 
  left_join(country) %>% 
  filter(!is.na(medal)) %>% 
  count(noc, sort = TRUE)

medals_count_15 <- medals_count %>% 
  slice_max(n, n=15) %>% 
  rename(country=noc)

  kable(medals_count_15, caption = "Top 15 countries with most medals overall")

Top 15 countries with most medals overall
country	n
USA	5637
URS	2503
GER	2165
GBR	2068
FRA	1777
ITA	1637
SWE	1536
CAN	1352
AUS	1320
RUS	1165
HUN	1135
NED	1040
NOR	1033
GDR	1005
CHN	989

Comment

This table show the 15 countries with most medals overall. Yet, this table lost many data points that interest me, such as the years and number of medals per years. I’m also still missing the number of medals per country, per year.

I can compute the number of medals per country and year in a new column. To make sure I don’t build on previous errors, I decide to start again with the original tibble “medals” and join it with “country”. Next, I crate a “year” column by extracting every 4 digits from the column “games”. Finally, I can use group_by() on the country and year and create a new column “n_medals_country_year”.

medals_count_year <- medals %>% 
  left_join(country) %>% 
  filter(!is.na(medal)) %>% 
  mutate(year = str_extract(games, "\\b\\d{4}\\b")) %>% 
  group_by(year, noc) %>% 
  mutate(n_medals_country_year = n(), year, noc) %>% 
  ungroup()

Now, I can join the top 15 country selection with the most medals overall tibble: “medals_count_15” with “medals_count_year” and have country = noc as a key. I will then have to use the pivot function to have one year per column, and the number of medals per year, per country. In this question there is no distinction between a glod, silver or bronze medal, all types of medals are taken into account and summed.

top_15_countries_medals_year <- medals_count_15 %>% 
  left_join(medals_count_year, by= c("country" = "noc")) %>% 
  select(country, year, n_medals_country_year) %>% 
  distinct(country, year, n_medals_country_year) %>% 
  pivot_wider(names_from = "year", values_from = "n_medals_country_year", values_fill =0) %>%
  select(country, sort(names(.)[-1], decreasing = TRUE)) %>% 
  mutate(total = rowSums(across(-country)))

library(kableExtra)
top_15_countries_medals_year %>% 
  kable(format = "html", caption = "Number of medals per year for top 15 country with most medals overall") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width = "100%", height = "400px")

Number of medals per year for top 15 country with most medals overall
country	2016	2014	2012	2010	2008	2006	2004	2002	2000	1998	1996	1994	1992	1988	1984	1980	1976	1972	1968	1964	1960	1956	1952	1948	1936	1932	1928	1924	1920	1912	1908	1906	1904	1900	1896	total
USA	264	64	248	97	317	52	263	84	242	34	259	19	238	214	361	30	175	195	173	177	152	149	164	168	112	223	102	194	194	107	65	24	394	63	20	5637
URS	0	0	0	0	0	0	0	0	0	0	0	0	0	366	56	496	342	259	229	221	211	206	117	0	0	0	0	0	0	0	0	0	0	0	0	2503
GER	159	36	94	54	99	54	149	61	118	44	124	40	236	0	0	0	0	0	0	126	98	54	52	0	231	57	82	0	0	53	21	30	16	45	32	2165
GBR	145	10	126	1	81	1	57	6	54	4	26	3	50	54	73	48	33	29	19	30	28	46	32	63	53	35	59	95	111	170	368	39	2	108	9	2068
FRA	96	18	82	14	77	15	53	15	66	15	51	11	69	31	70	30	21	25	36	38	15	33	41	77	45	42	52	120	141	26	40	64	2	235	11	1777
ITA	72	14	68	5	42	25	104	21	65	17	71	32	66	37	65	40	35	34	41	61	89	55	56	68	76	77	72	51	84	25	8	56	0	5	0	1637
SWE	28	51	22	18	7	64	12	26	32	7	31	24	42	46	64	43	12	26	25	43	16	49	93	88	45	35	53	81	154	190	84	21	0	4	0	1536
CAN	69	86	55	90	35	69	17	74	31	49	50	37	80	29	91	2	26	12	30	12	30	39	21	22	35	59	45	29	16	8	52	2	48	2	0	1352
AUS	82	3	114	3	149	2	157	2	183	1	132	4	57	35	52	13	23	20	51	44	46	67	20	16	1	5	4	11	7	0	0	3	4	6	3	1320
RUS	115	68	140	25	142	41	189	38	187	52	115	36	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	14	3	0	0	0	0	1165
HUN	22	0	26	0	27	0	40	0	53	0	43	0	45	44	0	63	55	81	81	56	66	66	104	64	43	35	21	22	0	30	20	13	4	5	6	1135
NED	47	29	69	11	61	13	77	8	79	11	73	4	37	52	41	10	20	15	23	33	5	0	21	42	37	13	57	27	50	26	16	6	0	27	0	1040
NOR	19	36	17	39	22	23	7	41	43	40	23	30	47	33	19	15	16	25	28	15	11	7	32	31	46	10	24	39	137	80	42	25	2	9	0	1033
GDR	0	0	0	0	0	0	0	0	0	0	0	0	0	213	33	303	227	171	58	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1005
CHN	113	12	125	19	184	13	94	16	79	14	106	3	85	52	74	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	989

Part 5

Instructions

Is there a relationship between country and the probability of winning a medal?

We know that some countries win more medals than others. In this part we want to work out whether some countries win more medals simply because they have more athletes competing for more events, or whether their athletes are in fact performing better in general.

For this part, look only at the top 15 countries that you established in part 6.

Create a horizontal barchart listing for each of the fifteen countries the percentage of medals won out of all medals competed for by that country.

Explanation

For this exercise, I can’t ignore the NAs in the “medal” column, because to compute the percentage of medals won, I need the total medals competed for. The percentage of medal won per country = medals won * 100 /total medals competed for.

As a first step I count the rows per country in the “medals1” tibble, because it already has a year and country column. The rows per country are the number of athletes competing for each event, i.e. the number of medals the country is competing for.

medals_total <- medals %>%
  left_join(country) %>% 
  count(noc) %>% 
  rename(n_total = n)

As a second step, I create another tibble that only contains the medals won. I filter the NAs medals out and count again the rows per country. This tibble “medals_won” will only have 2 columns “noc” and “n”.

medals_won <- medals %>%
  left_join(country) %>% 
  filter(!is.na(medal)) %>% 
  count(noc)

As a third step, I will compute the percentage by first joining the tibble with the medals won only “medals_won” with the tibble that has the total of medals competed for, “medals_total”.

medal_percentage <- medals_won %>% 
  left_join(medals_total, by= c("noc")) %>%
  group_by(noc) %>% 
  mutate(percentage_m_won = n/n_total*100) %>% 
  ungroup() %>% 
  rename(country = noc)

The fourth step is to only keep the top 15 country I have selected earlier. To do so I am merging the top 15 countries with the tibble I just created with the percentage of won medals per country “medal_percentage”.

top_15 <- medals_count_15 %>% 
  select(-n) # to have a clean joined tibble since "n" exists in both medals_count_15 and medal_percentage

medal_percentage_15 <- top_15 %>% 
  left_join(medal_percentage) %>% 
  mutate(percentage_m_won = round(percentage_m_won, 1))

Creating a bar plot

As a next step, I have to create a bar plot that will show the percentage of medals won per country over the years.

medal_percentage_15 %>% 
  ggplot(aes(x= reorder(country, percentage_m_won),
              y=percentage_m_won,
             fill = country)) +
  geom_col() +
  theme_minimal()+
  labs(title = "Percentage of won medals for the top 15 countries",
       x = "countries",
       y = "percentage")

medals_total %>% 
  filter(noc == "GDR") %>% 
  pull(n_total) %>% 
  first()

## [1] 2645

medals_total %>% 
  filter(noc == "URS") %>% 
  pull(n_total) %>% 
  first()

## [1] 5685

Comment

I notice that overall countries with the most athletes competing for most events also win the most medals: URS, USA, RUS are in the top 4. An interesting point is that USA was the country with most events competed for with 18853 but has 30% of won medals versus 38% for GDR (East Germany) that had 2645 total competitors and URS has close to 45% of won medals for 5685 a total that is also lower than for the USA.

Explanation

Chi-Squared test

To further confirm whether or not there is a relationship between the country and the probability to win medals, I have to perform an analysis with a chi-squared test. The value of the chi-squared statistic will determine the strength of the relationship between the 2 categorical variables. If it is high, this means the relationship is strong, and vice versa. The two categorical values are: country and medal status (won/lost).

Hypothesis: Countries competing with most events have a higher probability to win medals. Null hypothesis: Regardless of the number of competed events, the probability to win medals is equivalent for countries.

First I have to take the original tibble “medals” join it with “country” to obtain the column for the countries which is the explanatory variable. Next I only filter the top 15 countries and create the column “status” with “won” and “lost” values depending on whether a medal was won for an event or not. This is the response variable.

medal_status <- medals %>% 
  left_join(country) %>% 
  rename(country = noc) %>% 
  filter(country %in% top_15$country) %>% 
  mutate(status = if_else(is.na(medal), "lost", "won"))

Finally, I perform a chi-squared test.

test_country_medal<- chisq_test(medal_status, formula = status ~ country)

test_country_medal

Comment

The p-value is 0. This value is below 0.05, this means that the null hypothesis can be rejected. As a consequence I can’t reject the hypothesis that there is a relationship between the countries - number of events competed for - and the probability to win medals for the top 15 countries.

Athletes and Sports

Part 6

Instructions

Create a scatterplot showing the average height versus the average weight of competitors per sport (one dot per sport). Add labels with the sport names for:

the largest average height
the largest average weight
the smallest average height
the smallest average weight
the largest average BMI
the smallest average BMI

It might be that the same dot qualifies for multiple labels. The formula to calculate Body Mass Index (BMI) is:

weight in kg / (height in meters)^2

Explanation

I need columns for:

average height
average weight
bmi (although not necessary I prefer to compute to keep an intermediary step)
average bmi

As a first step, I create a column with the height in meters. Then, the average will have to be the result of competitors per sport, so I have to group by sport to obtain those averages and create a column for each average. I also need to avoid the NAs.

athletes_medals <- athletes %>% 
  left_join(medals, by=c("id" = "athlete_id"))

athletes_medals_av_bmi <- athletes_medals %>%
  mutate(height_m = height /100) %>% 
  group_by(sport) %>% 
  summarise(average_weight = mean(weight, na.rm = TRUE),
         average_height_m = mean(height_m, na.rm = TRUE),
         bmi = weight/(height_m^2),
         average_bmi = mean(bmi, na.rm = TRUE)) %>% 
  ungroup()

To label the sports with:

the largest average height
the largest average weight
the smallest average height
the smallest average weight
the largest average BMI
the smallest average BMI

I have to create a separate data set with only those values. This tibble will be used in ggplot2. I take the tibble I just created “athletes_medals_av_bmi” with all the necessary columns and use distinct() to have unique values per sport, average weight, height and bmi, and then filter the extreme values for each category with max() and min().

sport_1 <- athletes_medals_av_bmi %>% 
  distinct(sport, average_weight, average_height_m, average_bmi) %>% 
  filter(
    average_weight == min(average_weight, na.rm = TRUE)|
      average_weight == max(average_weight, na.rm = TRUE)|
      average_bmi == min(average_bmi, na.rm= TRUE)|
      average_bmi == max(average_bmi, na.rm = TRUE)|
      average_height_m == min(average_height_m, na.rm = TRUE)|
      average_height_m == max(average_height_m, na.rm = TRUE))

Before using “sport_1” to highlight the extreme values in BMI, average height and weight, I add another column named “label” to specify why these data points are of interest.

sport_labels<- sport_1 %>% 
  mutate(label= case_when(
    sport == "Basketball"~ "Basketball: Max Height",
    sport == "Rhythmic Gymnastics" ~ "Rhythmic Gymnastics: Min Weight and BMI",
    sport == "Tug-Of-War" ~ "Tug-Of-War: Max Weight",
    sport == "Gymnastics" ~ "Gymnastics: Min Height",
    sport == "Weightlifting" ~ "Weightlifting: Max BMI"
  ))

Creating a scatter plot

This plot shows the average height per average weight. Each dot is a sport, the labelled ones are the extreme averages.

athletes_medals_av_bmi %>% 
  distinct(sport, average_weight, average_height_m, average_bmi) %>%
  ggplot(aes(x = average_height_m, y = average_weight)) +
  geom_point() +  
  geom_text_repel( 
    data = sport_labels, 
    aes(label = label),
    color = "purple",
    size = 2) +
  labs(
    title = "Average height vs weight by Sport (Labeling Extremes)",
    x = "Average height (m)",
    y = "Average weight (kg)"
  ) +
  theme_minimal()

Comment

The Rhythmic Gymnastics has the smallest average weight and bmi. Tug-Of-War has the largest average weight. Basketball has the largest average height. Gymnastics has the smallest average height. Weightlifting has the largest average bmi.

Part 7

Instructions

Create a line plot showing the number of medals given by year (one line for Gold, one line for Silver and one line for Bronze). Does it change over time? Use facets to separate the medals at Summer games and Winter games.

Comment the results below the plot.

Explanation

I will mostly rely on the “medals” tibble for this exercise. To create the line plot I need a column with the years and a column with the seasons (winter/summer). I will first create the 2 columns by separating the “games” column and make sure the year column is considered as an integer.

medals_7 <- medals %>% 
  separate(games, into = c("year", "season"), sep = " ") %>% 
  mutate(year = as.integer(year))

As a second step, I will count the number of gold, silver and bronze medals by creating a column for each medal type. Then, I will be able to group by year and count the number of gold, silver and bronze medals per year, again one column per medal type.

medals_n_types <- medals_7 %>% 
  mutate(n_gold = if_else(medal == "Gold", 1L, 0L), # 1L and 0L return integers
         n_silver = if_else(medal == "Silver", 1L, 0L),
         n_bronze = if_else(medal == "Bronze", 1L, 0L)) %>% 
  group_by(year, season) %>% 
  mutate(gold = sum(n_gold, na.rm = TRUE),
         silver = sum(n_silver, na.rm = TRUE),
         bronze = sum(n_bronze, na.rm = TRUE)) %>% 
  ungroup()

As a third step, I will pivot_longer() the columns counting the number of gold, silver, and bronze per year, so that they become rows, and their values fill the column “n_medals” that will match the years, season and medal type. The “medal_type” column is important because it will be useful to have one line per medal type in my plot.

medals_n_types_long <- medals_n_types %>% 
  pivot_longer(cols= c(gold, silver, bronze),
               names_to = "medal_type",
               values_to = "n_medals") %>% 
  distinct(year, season, medal_type, n_medals) %>% 
  paged_table()

Creating a line plot

This plot represents the number of medals per year. Each line is medal type. Each facet is a season.

medals_n_types_long %>% 
  ggplot(aes(x = year,
             y = n_medals))+
  geom_line(aes(group=medal_type, color = medal_type))+
  facet_wrap(vars(season))+
  theme_minimal() +
  labs(title = "Number of medals per year",
       subtitle = "Using facets to separate the Summer and Winter games",
       x = "Years",
       y = "Number of medals", 
       color = "Medal type" )

Comment

The lines start around 1900 for the summer games, whereas the lines start after 1920 for the winter games. I assume the first games were only in summer. The medals for summer games have been steadily increasing from 100 in 1900 to 700 early 2010. There was a important increase in 1920, after that the number of medals decreased. The winter games show a different trend: the medals increased progressively but slower than for the summer games. In 1930, there were around 50 medals per medal type and in 2010, there were only 200.

Things you did not know about the Olympic Games

Load the data

Comment

Data exploration

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Data analysis

Athletes info

Part 1

Instructions

Explanation

Comment

Part 2

Instructions

Explanation

Comment

Part 3

Instructions

Explanation

Comment

Explanation

Comment

Countries info

Part 4

Instructions

Explanations

Comment

Part 5

Instructions

Explanation

Creating a bar plot

Comment

Explanation

Chi-Squared test

Comment

Athletes and Sports

Part 6

Instructions

Explanation

Creating a scatter plot

Comment

Part 7

Instructions

Explanation

Creating a line plot

Comment