11 min read

Fun with Statistics - Is Usain Bolt really the fastest man on earth?

If you search for the phrase “fastest man on earth” in Google, chances are that it will return the answer “Usain Bolt”. It certainly does so for me, even though the results might be different if Google decides to personalize the results for you. This is because currently he holds the world record for being the quickest (9.58s) to run a 100 metres race. Justin Gatlin once did it in 9.45s, but that does not count as the record because he was assisted by the wind.

However, as statisticians, we know that a single data point does not necessarily prove anything and we should look at the distribution of timings for the top runners to check if Usain Bolt is truly the fastest man on earth. In this article, we will try to find an answer to this question using data.

Data

Our first step is to find data on the 100m race timings for all top runners. Unfortunately, I could not find any public source of this data. The best option we have is to use data from the excellent website Alltime Athletics maintained by Peter Larsson. This page provides the details of the all-time men’s best 100m race timings. We can answer a slightly different question using this data. Out of all runners who appear in this list, is there any evidence to suggest that there is a statistical difference among the top runners?

Our first step will be to extract and store the data in a R data frame. We use the rvest and readr R packages to do this.

library(rvest)
## Warning: package 'xml2' was built under R version 3.6.1
library(readr)
male_100_html <- read_html("http://www.alltime-athletics.com/m_100ok.htm")

An examination of the HTML source reveals that the data has been stored within HTML pre-tags. There are also seven tables available as part of the page, out of which we are interested in the table “All-time men’s best 100m”. We use the function html_nodes from the rvest package using an XPath selector. The next step is to extract the HTML text using the function html_text, and restrict to the first element.

male_100_pres <- male_100_html %>%
  html_nodes(xpath = "//pre")
male_100_htext <- male_100_pres %>%
  html_text()
male_100_htext <- male_100_htext[[1]]

The data is stored in fixed-width format in the variable male_100_htext. The read_fwf function from the readr package can be used to store the data in a data frame. The exact arguments to the function were decided using a bit of trial and error.

male_100 <- readr::read_fwf(male_100_htext, skip = 1, n_max = 3178,
                            col_types = cols(.default = col_character()),
                            col_positions = fwf_positions(
                              c(1, 16, 27, 35, 66, 74, 86, 93, 123),
                              c(15, 26, 34, 65, 73, 85, 92, 122, 132)
                            ))
str(male_100)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3178 obs. of  9 variables:
##  $ X1: chr  "1" "2" "3" "3" ...
##  $ X2: chr  "9.58" "9.63" "9.69" "9.69" ...
##  $ X3: chr  "+0.9" "+1.5" "±0.0" "+2.0" ...
##  $ X4: chr  "Usain Bolt" "Usain Bolt" "Usain Bolt" "Tyson Gay" ...
##  $ X5: chr  "JAM" "JAM" "JAM" "USA" ...
##  $ X6: chr  "21.08.86" "21.08.86" "21.08.86" "09.08.82" ...
##  $ X7: chr  "1" "1" "1" "1" ...
##  $ X8: chr  "Berlin" "London" "Beijing" "Shanghai" ...
##  $ X9: chr  "16.08.2009" "05.08.2012" "16.08.200" "20.09.2009" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   .default = col_character(),
##   ..   X1 = col_character(),
##   ..   X2 = col_character(),
##   ..   X3 = col_character(),
##   ..   X4 = col_character(),
##   ..   X5 = col_character(),
##   ..   X6 = col_character(),
##   ..   X7 = col_character(),
##   ..   X8 = col_character(),
##   ..   X9 = col_character()
##   .. )

For this analysis, we are only interested in the name of the runner and the race timing. So we restrict the data to these two columns, and give each column an appropriate name.

library(dplyr)
male_100 <- male_100 %>%
  select(X2, X4)
colnames(male_100) <- c("timing", "runner")

We observe that some of the timings have an “A” against the time. The Wikipedia page Athletics abbreviations informs us that this indicates a mark set at altitude. We simply remove the “A” and also convert the variable timing into a numeric variable.

male_100 <- male_100 %>%
  mutate(timing = gsub("A", "", timing),
         timing = as.numeric(timing))
male_100
## # A tibble: 3,178 x 2
##    timing runner       
##     <dbl> <chr>        
##  1   9.58 Usain Bolt   
##  2   9.63 Usain Bolt   
##  3   9.69 Usain Bolt   
##  4   9.69 Tyson Gay    
##  5   9.69 Yohan Blake  
##  6   9.71 Tyson Gay    
##  7   9.72 Usain Bolt   
##  8   9.72 Asafa Powell 
##  9   9.74 Asafa Powell 
## 10   9.74 Justin Gatlin
## # ... with 3,168 more rows

Exploration

As with any statistical analysis, the first step is to explore and understand the data we are working with. We see that there are 318 unique runners in the data.

n_distinct(male_100$runner)
## [1] 340
runner_cnt <- male_100 %>%
  group_by(runner) %>%
  summarise(n_rec = n()) %>%
  arrange(desc(n_rec))
runner_cnt
## # A tibble: 340 x 2
##    runner           n_rec
##    <chr>            <int>
##  1 Asafa Powell       145
##  2 Mike Rodgers       103
##  3 Justin Gatlin      102
##  4 Maurice Greene      83
##  5 Nesta Carter        69
##  6 Usain Bolt          69
##  7 Tyson Gay           67
##  8 Yohan Blake         66
##  9 Ato Boldon          65
## 10 Frank Fredericks    62
## # ... with 330 more rows

As expected, some of the well-known names dominate the list with the maximum number of records. Our next step is to look at the distribution of the number of records.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.1
ggplot(runner_cnt, aes(n_rec)) +
  geom_histogram(binwidth = 5, fill = "lightblue", colour = "black") +
  xlab(NULL) +
  ylab(NULL) +
  ggtitle("Distribution of number of records by runner") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0.5, vjust = 0.5)
  )

As expected, this distribution is heavily skewed as the top runners will tend to dominate the list. We first restrict our data to those who have at least 50 records each.

male_100_2 <- male_100 %>%
  inner_join(
    select(filter(runner_cnt, n_rec >= 50), runner),
    by = c("runner" = "runner")
  )

There are 16 runners left in this data. Let us have a look at the distribution of their race timings as well as the mean and median race timings for these runners.

ggplot(male_100_2, aes(timing)) +
  geom_density() +
  facet_wrap(~runner, nrow = 4, ncol = 4) +
  xlab(NULL) +
  ylab(NULL) +
  ggtitle("Distribution of race timings by runner") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0.5, vjust = 0.5)
  )

male_100_means <- male_100_2 %>%
  group_by(runner) %>%
  summarise(mean_timing = mean(timing)) %>%
  arrange(mean_timing)
male_100_means
## # A tibble: 15 x 2
##    runner           mean_timing
##    <chr>                  <dbl>
##  1 Usain Bolt              9.90
##  2 Asafa Powell            9.94
##  3 Tyson Gay               9.95
##  4 Justin Gatlin           9.96
##  5 Yohan Blake             9.96
##  6 Maurice Greene          9.97
##  7 Ato Boldon             10.00
##  8 Mike Rodgers           10.00
##  9 Jimmy Vicaut           10.0 
## 10 Frank Fredericks       10.0 
## 11 Nesta Carter           10.0 
## 12 Carl Lewis             10.0 
## 13 Linford Christie       10.0 
## 14 Dennis Mitchell        10.0 
## 15 Kim Collins            10.0
male_100_medians <- male_100_2 %>%
  group_by(runner) %>%
  summarise(median_timing = median(timing)) %>%
  arrange(median_timing)
male_100_medians
## # A tibble: 15 x 2
##    runner           median_timing
##    <chr>                    <dbl>
##  1 Usain Bolt                9.89
##  2 Asafa Powell              9.95
##  3 Justin Gatlin             9.97
##  4 Maurice Greene            9.97
##  5 Tyson Gay                 9.97
##  6 Yohan Blake               9.98
##  7 Ato Boldon               10   
##  8 Mike Rodgers             10   
##  9 Nesta Carter             10.0 
## 10 Frank Fredericks         10.0 
## 11 Jimmy Vicaut             10.0 
## 12 Linford Christie         10.0 
## 13 Dennis Mitchell          10.0 
## 14 Carl Lewis               10.0 
## 15 Kim Collins              10.0

There are 6 runners who are at the top using both the mean or the median. Also, for all of them, the mean and the median timings are less than 10 seconds. So we will further restrict the data to these 6 runners, and see if their race timings are significantly different.

male_100_3 <- male_100_2 %>%
  filter(runner %in% c("Usain Bolt", "Asafa Powell", "Yohan Blake",
                       "Justin Gatlin", "Maurice Greene", "Tyson Gay"))

Analysis

Now that we have the data ready, our next step is to try and answer the original question. We will follow two different approaches. Our first approach will be the non-parametric Kruskall-Wallis rank sum test. We will also do a multiple comparison test for the Kruskall-Wallis. The next approach will be a pairwise comparison of the runners using simulations. This approach was popularized in an article by Allen Downey, which said that there is only one statistical test. Another article which implements the same idea, and provides code using the infer package, can be found in this link.

kt <- kruskal.test(timing ~ runner, data = male_100_3)
kt
## 
##  Kruskal-Wallis rank sum test
## 
## data:  timing by runner
## Kruskal-Wallis chi-squared = 23.325, df = 5, p-value = 0.0002925

Since the p-value is significant at the 99% level, it indicates that at least one of the runners is different from the others.

library(pgirmess)
ktph <- kruskalmc(male_100_3$timing, male_100_3$runner)
ktph$dif.com
##                                 obs.dif critical.dif difference
## Asafa Powell-Justin Gatlin    27.379784     58.30836      FALSE
## Asafa Powell-Maurice Greene   49.491525     62.10274      FALSE
## Asafa Powell-Tyson Gay        24.277355     66.65196      FALSE
## Asafa Powell-Usain Bolt       57.032834     65.98797      FALSE
## Asafa Powell-Yohan Blake      39.856165     66.99643      FALSE
## Justin Gatlin-Maurice Greene  22.111741     66.69805      FALSE
## Justin Gatlin-Tyson Gay        3.102429     70.95325      FALSE
## Justin Gatlin-Usain Bolt      84.412617     70.32987       TRUE
## Justin Gatlin-Yohan Blake     12.476381     71.27693      FALSE
## Maurice Greene-Tyson Gay      25.214170     74.10297      FALSE
## Maurice Greene-Usain Bolt    106.524358     73.50631       TRUE
## Maurice Greene-Yohan Blake     9.635360     74.41295      FALSE
## Tyson Gay-Usain Bolt          81.310188     77.38805       TRUE
## Tyson Gay-Yohan Blake         15.578810     78.24973      FALSE
## Usain Bolt-Yohan Blake        96.888999     77.68493       TRUE

The only two cases where there seems to be a difference are - “Justin Gatlin-Usain Bolt” and “Maurice Green-Usain Bolt”.

We now do the pairwise comparison of athletes using simulations. To do that, we will write a function to compare two individual runners. The first step will be to calculate the difference in medians in the observed data. This is accomplished using the specify and calculate functions in the infer package. specify is used to specify the response and explanatory variables. calculate is used to calculate the test statistic, which is the difference in medians in this case. The second step is similar, except that we include a null hypothesis that there is no difference in the median timings between the runners. We also perform 10,000 simulations - where each simulation permutes the runners in the existing data, and then calculate the difference in medians between the two groups. The infer package also provides a visualize function to visualize the distribution of the simulated test statistic.

library(infer)
set.seed(2154)
compare_runners <- function(runner1, runner2, visualize = FALSE) {
  
  diff_med <- male_100_3 %>%
    filter(runner %in% c(runner1, runner2)) %>%
    arrange(runner) %>%
    specify(timing ~ runner) %>%
    calculate("diff in medians",
              order = c(runner1, runner2))
  
  diff_med_null <- male_100_3 %>%
    filter(runner %in% c(runner1, runner2)) %>%
    arrange(runner) %>%
    specify(timing ~ runner) %>%
    hypothesize(null = "independence") %>%
    generate(reps = 10000, type = "permute") %>%
    calculate("diff in medians",
              order = c(runner1, runner2))
  
  if (visualize) {
    diff_med_plot <- diff_med_null %>%
      visualize(method = "simulation") +
      shade_p_value(obs_stat = diff_med, direction = "less") +
      xlab("Difference in median run timings") +
      ylab(NULL) +
      ggtitle(paste0("Comparing ", runner1, " and ", runner2)) +
      scale_x_continuous(labels = scales::number_format(accuracy = 0.01)) +
      theme_bw() +
      theme(
        panel.border = element_blank(),
        plot.title = element_text(hjust = 0.5, vjust = 0.5)
      )
    print(diff_med_plot)
  }
  
  diff_med_null %>%
    get_pvalue(obs_stat = diff_med, direction = "less")
    
}

The function is applied to each pair of runners. Rather than looking at the results for all the 15 pairs of runners, we do it based on their rankings by median times. There is only one exception, where we also look at the pair Usain Bolt-Yohan Blake, who have the 1st and 3rd least median times respectively. The results for each pair, as well as the p-value generated by the simulation, are shown in the graphs and tables below.

compare_runners("Usain Bolt", "Asafa Powell", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1  0.0036
compare_runners("Asafa Powell", "Yohan Blake", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.108
compare_runners("Yohan Blake", "Justin Gatlin", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.698
compare_runners("Justin Gatlin", "Maurice Greene", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.705
compare_runners("Maurice Greene", "Tyson Gay", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.698
compare_runners("Usain Bolt", "Yohan Blake", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

There is only a 1.85% chance of seeing a difference as large as the observed difference if there is actually no difference between the timings of Usain Bolt and Asafa Powell. The chance for the pair of Usain Bolt and Yohan Blake is even lower at 0.27%. For all the remaining pairs, we fail to reject the null hypothesis and do not find sufficient evidence to conclude that the timings for these pairs are statistically different. It appears that Usain Bolt is better than the next best in the list which is Asafa Powell and Yohan Blake. Given the data we have, we do have reason to believe that Usain Bolt is the fastest man on earth (as measured by 100m race timings).

The reproducible code for this analysis can be found in Github.