Book Reviews

The Book of R - Tilman M. Davies

In my day job, I use the software and programming language SAS®. For a long time, I was intending to learn R; however, most of the well-known books were primarily about statistics and not R as a programming language. In 2012, “The Art of R Programming” was released which sort of met my needs. However, given the lack of exercises in the book, I could never go beyond the first couple of chapters as I would tend to “forget” the material quickly.

“The Book of R” is a perfect beginner’s R book. Borrowing the author’s own words, it provides a precursor and supplement to the “The Art of R Programming”. It is divided into five parts. The first two parts provide the basics of R as a programming language. The third and fourth parts teach you some statistics and statistical inference along with how to do these using R. The fifth part teaches you about advanced plotting techniques available in R. Each section is accompanied by a set of exercises which reinforce the concepts learned in the section. The end of each chapter also provides a list of the important code used in the chapter which can be handy in a future review of the material. This book is strongly recommended if you are using R for the first time and need something deeper than the quick online tutorials available from many web sites.

R for Data Science - Hadley Wickham & Garrett Grolemund

While this is a very good book, I would not recommend it to beginners. The book teaches you how to extract, transform, load, visualize and model data using the tidyverse and related R packages. The tidyverse is a collection of packages which have an underlying design philosophy and maintain consistency between them. These packages provide a lot of functionality already available in base R in addition to much more. The advantage of using them is that they iron out the rough parts which were not apparent in the early design of base R packages. They also make your code much more consistent and easy to comprehend once you have understood the underlying philosophy. However, the trade-off is that there is some mental overhead required in using them.

The book is divided into five main sections - Explore, Wrangle, Program, Model and Communicate. Each section progressively builds on tools which you might use for an actual data science project. I am strongly of the opinion that you should only read this book once you have read “The Book of R” or “The Art of R Programming”. It is important to understand base R functionality; only then one is able to appreciate how the tidyverse functions improve upon that. Of course, in actual projects, I would encourage people to use the tidyverse once they have mastered it. This book is recommended after you have learnt a bit of R and now want to improve on your programming skills which can be applied to real life projects.

The Art of R Programming - Norman Matloff

This book was one of the first to focus on the programming aspects of R. It teaches you all the basics of the R language, with numerous worked out examples which the author refers to as “Extended Examples”.

However, I would not recommend this book to someone who is absolutely new to programming. The best option would be to first read “The Book of R” and then use this book to strengthen their knowledge of the language. Of course, if you are already good at programming in some other language like Python, this book can serve as an introduction to R. This also provides some knowledge around what happens in the background for those with some R experience. One example would be the distinction between S3 and S4 classes, which many experienced R programmers are not aware about. Another example would be functions related to debugging - even though I have been frequently debugging R code at work using RStudio, this book provides the details of the actual R functions which are used in the debugging process. The only criticism of the book which I have is that a few of the chapters appear rushed - for example, the chapter on string manipulation could have been done better. Overall, a good book for people with some programming experience to learn R.

The Visual Display of Quantitative Information - Edward R. Tufte

This book is a classic on data visualization. It comprises eight chapters and can be read over a weekend. If there is one key takeaway from the book, it is that above all - “Show the data, and show it honestly”. This implies that graphics must not have unnecessary clutter, must be designed so as to maximize the ink used to display the data and must not mislead the reader. In principle, most of them sound common sensical; however, he provides multiple real-life examples where these principles were not adhered and how can they be improved.

A possibly wrong takeaway from the book is that data graphics must adhere to a “minimalist” principle. I don’t believe Tufte intended that to be the message. However, many a times we come across graphics with fancy decorations and colours which have been just used for the sake of using them and do not provide any additional information or insights on the data at all. On the contrary, they might distract the reader to focus on the unimportant details. Tufte refers to this as “chartjunk” and urge readers to not use them while designing a data graphic. In fact, Charles Minard’s graphic on Napoleon’s march - often considered to be one of the best visualizations of all time - is not “minimalist” in any sense.

I would also caution readers against blindly following all the advice in the book, just because “Tufte said so”. For example, I personally find the standard box plot to be much more intuitive to understand that Tufte’s quartile plot (example here). The principles Tufte advocates should be adhered to in a general sense without being pedantic about them. To conclude, I would definitely rate this a must-read book for all data visualization enthusiasts.

ggplot2 Elegant Graphics for Data Analysis - Hadley Wickham

When I first started using ggplot2 in my own work, my natural instict was to search how to create a particular graphic, which will generally lead you to stackoverflow. There I came across suggestions like - use scale_fill_manual along with a named vector - to obtain say an appropriate legend. While I was always able to get my desired result after a bit of trial and error, I never really understood the why behind how it worked.

This book will teach you the underlying “layered grammar of graphics” on which ggplot2 is based. It is not meant to be used as a book where you see the sample code and modify it for your needs. Rather, this book is meant for you to understand how the grammar works and how to use ggplot2 to construct graphics using this grammar. After reading this book, you should be able to understand what are data, layers, scales, coord, facets and themes. You should be able to understand why Hadley wrote this answer in 2010.

My only complaint with this book is on Part III. Rather than give a very brief overview of some of the other important tidyverse packages, he could have used this section to work through some complex visualizations and how one can use ggplot2 to create them. The material in this part is not very good, and some of the example code will not work unless you are already a bit familiar with the tidyverse. Nevertheless, this is a must read book for any data visualization practitioner who wishes to use R in their work.

Advanced R - Hadley Wickham

Read this book if you have already used R and want to level up your R programming skills. This is definitely not a beginner’s book and should be read once you have some experience working in R. It also provides pointers on working with different programming paradigms like object-oriented or functional programming using R.

This book is divided into four main parts - Foundations, Functional Programming, Metaprogramming and Performant Code. The first part teaches you the intricacies of R, and help clarify frustrations with the language often faced by beginners. After reading this part, you should be able to understand how the same function called on different objects behave differently.

In R, functions are first class objects and can be passed to and returned from functions. The second part teaches you functional progamming using R and shows you how to effectively use this programming paradigm to solve real world problems. The third part teaches you metaprogramming, which generally refers to programs that write or manipulate other programs. After reading this part, you should be able to understand, for example, why you need to use quosures when programming with dplyr. The final part shows you how to profile your code, improve the performance of your code and use C++ to further improve performance when R is not sufficient for your needs.

Overall, this is a great book and a must read if you want to seriously program in R as well as develop R packages.

R Packages - Hadley Wickham

This book is most useful if you are intending to develop an R package. This book should not be read like a textbook, unlike R for Data Science or Advanced R. Instead I would suggest that you skim through the book quickly and then refer back to the appropriate chapter during the development process.

All aspects of developing an R package are explained in the book, starting from the structure of your directory, code, metadata, documentation and testing. The final section has three chapters on best practices, which I believe are the most important chapters of the book. After reading this section, you will have familiarity with the development process using GitHub, continuous integration and testing, and submitting to CRAN. Overall not a book which is “fun” to read, but very important for aspiring package developers who wish to see their package in CRAN.

R Graphics Cookbook - Winston Chang

When I first started using ggplot2, my main source of information would be StackOverflow. I would have a particular problem, search for it and get the syntax from StackOverflow which I would gladly copy-and-paste into my code. After some time, I could detect some common patterns but still did not understand how the overall code works. Finally, I read the ggplot2 book by Hadley Wickham which explained the underlying grammar and I could make sense of a lot of the code which I used.

However, now I wish that I had bought a copy of this book by Winston Chang earlier (I was in a sense lucky, because the 2nd edition came out in late 2018 and I got the latest edition instead). Like all programming cookbooks, it provides a set of recipes. Each recipe comprises a problem that you might have, along with a solution for the same. Not only that, in this book, the author provides a detailed discussion around each of the solutions as well as alternatives which be useful.

Most of the book uses ggplot2, except for one chapter which demonstrates the base plot functions. There is also a chapter on miscellaneous graphs which provide a few examples of graphs for which more specialised packages are required - for example, the igraph package for creating network graphs. After working through this book, the reader should be familiar with a wide variety of techniques to achieve a particular graphical output. The last chapter also provides an overview of some of the tidyverse packages to transform the data into a shape which is acceptable for ggplot2. This book can be used both as a text and a reference - I recommend users to work through most of the examples quickly in order to get a taste of the different kinds of charts you can produce and change their appearance. In the future, it can be used as a reference book to look up solutions for particular problems. Overall, strongly recommended if your job involves creating data visualisations using R.

R Markdown: The Definitive Guide - Yihui Xie, J.J.Allaire & Garrett Grolemund

This is a fairly comprehensive introduction to the kind of work you can do using knitr, rmarkdown and related packages. Originally meant to create reports, the R Markdown ecosystem can now be used for creating reports, dashboards, presentations, notebooks, books, websites and interactive applications. In fact, this website is itself generated using blogdown, an R package which uses the static site generator Hugo to help you build your own website.

Similar to the R Graphics Cookbook, this is meant to be a reference and not a text book. However, unlike the graphics book, it does not make sense to read this book cover to cover and work through all the examples. Instead, once you have a specific problem which you need to solve, then you should find the relevant chapter in the book to understand the basics and refer to more advanced sources for the details. Also note that the topics are not covered in great depth, so you will need to refer external sources once you have decided to use a particular infrastructure mentioned in this book. For example, I was recently using the ioslides_presentation output format available in R Markdown to create a presentation. However, given my specific requirements, there were lots of customisations involved using CSS, which are only available via StackOverflow answers or blog posts.

I personally prefer paper copies of every book I own and own a copy of this one. Depending on your needs and preferences, the online version of this book may be enough to serve as a reference.

Fundamentals of Data Visualization - Claus O. Wilke

This is a must-read book on data visualization which covers both the theory and practice of data visualization. While it describes some general principles for good visualizations, it also provides good explanations of why the author chose to do what to do along with plenty of practical examples. This is in contrast to books like “The Visual Display of Quantitative Information” which provides some general principles but do not provide enough information for a data scientist to readily apply these principles in a professional setting.

The book is divided into three parts. The first part describes the grammar of graphics, followed by a directory of visualizations. The grammar is described in a general way, without any reference to any specific implementations like the popular R package ggplot2. One chapter provides an overview of the various plots and charts which can be used to visualize different types of data, followed by detailed chapters on each type of visualization. The author not only describes what to do, but also his rationale for a particular choice, the tradeoffs between various choices wherever applicable, along with plenty of examples. In some cases, he also explains the statistical background based on which he has made a particular choice. The second part focuses more on the design issues like colors, fonts, axes labels and other practical guidelines. The third part covers an introduction to file types, visualization software and some basic principles of good visual story-telling.

I like that the book is opinionated. The author does not hesitate to call out the shortcomings in the work of well-known people, and provides logical arguments on why he thinks so. The treatment on designing for people who suffer from colour blindness is better than any other material I have read so far. While the author states in the Preface that there is no need to read the book cover to cover, I would strongly advise all readers to read this cover to cover once and then come back to specific chapters for reference based on your needs. It is a fairly quick read and can be covered over a few hours. I wish I had such a book when I was first getting into data science; this should be the first visualization book for aspiring data scientists and a useful reference for those with more experience.