If you search for the phrase “fastest man on earth” in Google, chances are that it will return the answer “Usain Bolt”. It certainly does so for me, even though the results might be different if Google decides to personalize the results for you. This is because currently he holds the world record for being the quickest (9.58s) to run a 100 metres race. Justin Gatlin once did it in 9.45s, but that does not count as the record because he was assisted by the wind.

However, as statisticians, we know that a single data point does not necessarily prove anything and we should look at the distribution of timings for the top runners to check if Usain Bolt is truly the fastest man on earth. In this article, we will try to find an answer to this question using data.

### Data

Our first step is to find data on the 100m race timings for all top runners. Unfortunately, I could not find any public source of this data. The best option we have is to use data from the excellent website Alltime Athletics maintained by Peter Larsson. This page provides the details of the all-time men’s best 100m race timings. We can answer a slightly different question using this data. Out of all runners who appear in this list, is there any evidence to suggest that there is a statistical difference among the top runners?

Our first step will be to extract and store the data in a R data frame. We use the *rvest* and *readr* R packages to do this.

`library(rvest)`

`## Warning: package 'xml2' was built under R version 3.6.1`

```
library(readr)
male_100_html <- read_html("http://www.alltime-athletics.com/m_100ok.htm")
```

An examination of the HTML source reveals that the data has been stored within HTML pre-tags. There are also seven tables available as part of the page, out of which we are interested in the table “All-time men’s best 100m”. We use the function *html_nodes* from the *rvest* package using an XPath selector. The next step is to extract the HTML text using the function *html_text*, and restrict to the first element.

```
male_100_pres <- male_100_html %>%
html_nodes(xpath = "//pre")
male_100_htext <- male_100_pres %>%
html_text()
male_100_htext <- male_100_htext[[1]]
```

The data is stored in fixed-width format in the variable *male_100_htext*. The *read_fwf* function from the *readr* package can be used to store the data in a data frame. The exact arguments to the function were decided using a bit of trial and error.

```
male_100 <- readr::read_fwf(male_100_htext, skip = 1, n_max = 3178,
col_types = cols(.default = col_character()),
col_positions = fwf_positions(
c(1, 16, 27, 35, 66, 74, 86, 93, 123),
c(15, 26, 34, 65, 73, 85, 92, 122, 132)
))
str(male_100)
```

```
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3178 obs. of 9 variables:
## $ X1: chr "1" "2" "3" "3" ...
## $ X2: chr "9.58" "9.63" "9.69" "9.69" ...
## $ X3: chr "+0.9" "+1.5" "±0.0" "+2.0" ...
## $ X4: chr "Usain Bolt" "Usain Bolt" "Usain Bolt" "Tyson Gay" ...
## $ X5: chr "JAM" "JAM" "JAM" "USA" ...
## $ X6: chr "21.08.86" "21.08.86" "21.08.86" "09.08.82" ...
## $ X7: chr "1" "1" "1" "1" ...
## $ X8: chr "Berlin" "London" "Beijing" "Shanghai" ...
## $ X9: chr "16.08.2009" "05.08.2012" "16.08.200" "20.09.2009" ...
## - attr(*, "spec")=
## .. cols(
## .. .default = col_character(),
## .. X1 = col_character(),
## .. X2 = col_character(),
## .. X3 = col_character(),
## .. X4 = col_character(),
## .. X5 = col_character(),
## .. X6 = col_character(),
## .. X7 = col_character(),
## .. X8 = col_character(),
## .. X9 = col_character()
## .. )
```

For this analysis, we are only interested in the name of the runner and the race timing. So we restrict the data to these two columns, and give each column an appropriate name.

```
library(dplyr)
male_100 <- male_100 %>%
select(X2, X4)
colnames(male_100) <- c("timing", "runner")
```

We observe that some of the timings have an “A” against the time. The Wikipedia page Athletics abbreviations informs us that this indicates a mark set at altitude. We simply remove the “A” and also convert the variable *timing* into a numeric variable.

```
male_100 <- male_100 %>%
mutate(timing = gsub("A", "", timing),
timing = as.numeric(timing))
male_100
```

```
## # A tibble: 3,178 x 2
## timing runner
## <dbl> <chr>
## 1 9.58 Usain Bolt
## 2 9.63 Usain Bolt
## 3 9.69 Usain Bolt
## 4 9.69 Tyson Gay
## 5 9.69 Yohan Blake
## 6 9.71 Tyson Gay
## 7 9.72 Usain Bolt
## 8 9.72 Asafa Powell
## 9 9.74 Asafa Powell
## 10 9.74 Justin Gatlin
## # ... with 3,168 more rows
```

### Exploration

As with any statistical analysis, the first step is to explore and understand the data we are working with. We see that there are 318 unique runners in the data.

`n_distinct(male_100$runner)`

`## [1] 340`

```
runner_cnt <- male_100 %>%
group_by(runner) %>%
summarise(n_rec = n()) %>%
arrange(desc(n_rec))
runner_cnt
```

```
## # A tibble: 340 x 2
## runner n_rec
## <chr> <int>
## 1 Asafa Powell 145
## 2 Mike Rodgers 103
## 3 Justin Gatlin 102
## 4 Maurice Greene 83
## 5 Nesta Carter 69
## 6 Usain Bolt 69
## 7 Tyson Gay 67
## 8 Yohan Blake 66
## 9 Ato Boldon 65
## 10 Frank Fredericks 62
## # ... with 330 more rows
```

As expected, some of the well-known names dominate the list with the maximum number of records. Our next step is to look at the distribution of the number of records.

`library(ggplot2)`

`## Warning: package 'ggplot2' was built under R version 3.6.1`

```
ggplot(runner_cnt, aes(n_rec)) +
geom_histogram(binwidth = 5, fill = "lightblue", colour = "black") +
xlab(NULL) +
ylab(NULL) +
ggtitle("Distribution of number of records by runner") +
theme_bw() +
theme(
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, vjust = 0.5)
)
```

As expected, this distribution is heavily skewed as the top runners will tend to dominate the list. We first restrict our data to those who have at least 50 records each.

```
male_100_2 <- male_100 %>%
inner_join(
select(filter(runner_cnt, n_rec >= 50), runner),
by = c("runner" = "runner")
)
```

There are 16 runners left in this data. Let us have a look at the distribution of their race timings as well as the mean and median race timings for these runners.

```
ggplot(male_100_2, aes(timing)) +
geom_density() +
facet_wrap(~runner, nrow = 4, ncol = 4) +
xlab(NULL) +
ylab(NULL) +
ggtitle("Distribution of race timings by runner") +
theme_bw() +
theme(
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, vjust = 0.5)
)
```

```
male_100_means <- male_100_2 %>%
group_by(runner) %>%
summarise(mean_timing = mean(timing)) %>%
arrange(mean_timing)
male_100_means
```

```
## # A tibble: 15 x 2
## runner mean_timing
## <chr> <dbl>
## 1 Usain Bolt 9.90
## 2 Asafa Powell 9.94
## 3 Tyson Gay 9.95
## 4 Justin Gatlin 9.96
## 5 Yohan Blake 9.96
## 6 Maurice Greene 9.97
## 7 Ato Boldon 10.00
## 8 Mike Rodgers 10.00
## 9 Jimmy Vicaut 10.0
## 10 Frank Fredericks 10.0
## 11 Nesta Carter 10.0
## 12 Carl Lewis 10.0
## 13 Linford Christie 10.0
## 14 Dennis Mitchell 10.0
## 15 Kim Collins 10.0
```

```
male_100_medians <- male_100_2 %>%
group_by(runner) %>%
summarise(median_timing = median(timing)) %>%
arrange(median_timing)
male_100_medians
```

```
## # A tibble: 15 x 2
## runner median_timing
## <chr> <dbl>
## 1 Usain Bolt 9.89
## 2 Asafa Powell 9.95
## 3 Justin Gatlin 9.97
## 4 Maurice Greene 9.97
## 5 Tyson Gay 9.97
## 6 Yohan Blake 9.98
## 7 Ato Boldon 10
## 8 Mike Rodgers 10
## 9 Nesta Carter 10.0
## 10 Frank Fredericks 10.0
## 11 Jimmy Vicaut 10.0
## 12 Linford Christie 10.0
## 13 Dennis Mitchell 10.0
## 14 Carl Lewis 10.0
## 15 Kim Collins 10.0
```

There are 6 runners who are at the top using both the mean or the median. Also, for all of them, the mean and the median timings are less than 10 seconds. So we will further restrict the data to these 6 runners, and see if their race timings are significantly different.

```
male_100_3 <- male_100_2 %>%
filter(runner %in% c("Usain Bolt", "Asafa Powell", "Yohan Blake",
"Justin Gatlin", "Maurice Greene", "Tyson Gay"))
```

### Analysis

Now that we have the data ready, our next step is to try and answer the original question. We will follow two different approaches. Our first approach will be the non-parametric Kruskall-Wallis rank sum test. We will also do a multiple comparison test for the Kruskall-Wallis. The next approach will be a pairwise comparison of the runners using simulations. This approach was popularized in an article by Allen Downey, which said that there is only one statistical test. Another article which implements the same idea, and provides code using the *infer* package, can be found in this link.

```
kt <- kruskal.test(timing ~ runner, data = male_100_3)
kt
```

```
##
## Kruskal-Wallis rank sum test
##
## data: timing by runner
## Kruskal-Wallis chi-squared = 23.325, df = 5, p-value = 0.0002925
```

Since the p-value is significant at the 99% level, it indicates that at least one of the runners is different from the others.

```
library(pgirmess)
ktph <- kruskalmc(male_100_3$timing, male_100_3$runner)
ktph$dif.com
```

```
## obs.dif critical.dif difference
## Asafa Powell-Justin Gatlin 27.379784 58.30836 FALSE
## Asafa Powell-Maurice Greene 49.491525 62.10274 FALSE
## Asafa Powell-Tyson Gay 24.277355 66.65196 FALSE
## Asafa Powell-Usain Bolt 57.032834 65.98797 FALSE
## Asafa Powell-Yohan Blake 39.856165 66.99643 FALSE
## Justin Gatlin-Maurice Greene 22.111741 66.69805 FALSE
## Justin Gatlin-Tyson Gay 3.102429 70.95325 FALSE
## Justin Gatlin-Usain Bolt 84.412617 70.32987 TRUE
## Justin Gatlin-Yohan Blake 12.476381 71.27693 FALSE
## Maurice Greene-Tyson Gay 25.214170 74.10297 FALSE
## Maurice Greene-Usain Bolt 106.524358 73.50631 TRUE
## Maurice Greene-Yohan Blake 9.635360 74.41295 FALSE
## Tyson Gay-Usain Bolt 81.310188 77.38805 TRUE
## Tyson Gay-Yohan Blake 15.578810 78.24973 FALSE
## Usain Bolt-Yohan Blake 96.888999 77.68493 TRUE
```

The only two cases where there seems to be a difference are - “Justin Gatlin-Usain Bolt” and “Maurice Green-Usain Bolt”.

We now do the pairwise comparison of athletes using simulations. To do that, we will write a function to compare two individual runners. The first step will be to calculate the difference in medians in the observed data. This is accomplished using the *specify* and *calculate* functions in the *infer* package. *specify* is used to specify the response and explanatory variables. *calculate* is used to calculate the test statistic, which is the difference in medians in this case. The second step is similar, except that we include a null hypothesis that there is no difference in the median timings between the runners. We also perform 10,000 simulations - where each simulation permutes the runners in the existing data, and then calculate the difference in medians between the two groups. The *infer* package also provides a *visualize* function to visualize the distribution of the simulated test statistic.

```
library(infer)
set.seed(2154)
compare_runners <- function(runner1, runner2, visualize = FALSE) {
diff_med <- male_100_3 %>%
filter(runner %in% c(runner1, runner2)) %>%
arrange(runner) %>%
specify(timing ~ runner) %>%
calculate("diff in medians",
order = c(runner1, runner2))
diff_med_null <- male_100_3 %>%
filter(runner %in% c(runner1, runner2)) %>%
arrange(runner) %>%
specify(timing ~ runner) %>%
hypothesize(null = "independence") %>%
generate(reps = 10000, type = "permute") %>%
calculate("diff in medians",
order = c(runner1, runner2))
if (visualize) {
diff_med_plot <- diff_med_null %>%
visualize(method = "simulation") +
shade_p_value(obs_stat = diff_med, direction = "less") +
xlab("Difference in median run timings") +
ylab(NULL) +
ggtitle(paste0("Comparing ", runner1, " and ", runner2)) +
scale_x_continuous(labels = scales::number_format(accuracy = 0.01)) +
theme_bw() +
theme(
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, vjust = 0.5)
)
print(diff_med_plot)
}
diff_med_null %>%
get_pvalue(obs_stat = diff_med, direction = "less")
}
```

The function is applied to each pair of runners. Rather than looking at the results for all the 15 pairs of runners, we do it based on their rankings by median times. There is only one exception, where we also look at the pair Usain Bolt-Yohan Blake, who have the 1st and 3rd least median times respectively. The results for each pair, as well as the p-value generated by the simulation, are shown in the graphs and tables below.

`compare_runners("Usain Bolt", "Asafa Powell", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.0036
```

`compare_runners("Asafa Powell", "Yohan Blake", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.108
```

`compare_runners("Yohan Blake", "Justin Gatlin", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.698
```

`compare_runners("Justin Gatlin", "Maurice Greene", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.705
```

`compare_runners("Maurice Greene", "Tyson Gay", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.698
```

`compare_runners("Usain Bolt", "Yohan Blake", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
```

There is only a 1.85% chance of seeing a difference as large as the observed difference if there is actually no difference between the timings of Usain Bolt and Asafa Powell. The chance for the pair of Usain Bolt and Yohan Blake is even lower at 0.27%. For all the remaining pairs, we fail to reject the null hypothesis and do not find sufficient evidence to conclude that the timings for these pairs are statistically different. It appears that Usain Bolt is better than the next best in the list which is Asafa Powell and Yohan Blake. Given the data we have, we do have reason to believe that Usain Bolt is the fastest man on earth (as measured by 100m race timings).

The reproducible code for this analysis can be found in Github.