Data visualization can be beautiful, humorous, and sometimes a waste of ink, but it is an important tool for understanding and explaining what the data is trying to tell us. One of the better justifications for graphical exploration of data was given by Francis Anscombe, who created a series of data sets and accompanying graphs known as Ascombe’s Quartet.
Anscombe was a statistician during the early days of statistical computing, and stressed that “a computer should make both calculations and graphs”. To make his point, Anscombe created four data sets of two measurements on each of eleven objects (e.g., eleven kids, each measured for x = height and y = weight). He made sure that all four datasets have identical values of the typical statistics used to characterize data – average, variance, fitted line, and correlation coefficient. But when the data are plotted on a graph of x versus y – no more complex than the average bar napkin – they are strikingly different (above).
Anscombe’s point was that, if you don’t plot the data, you could be missing something important and arrive at the wrong conclusion. For set one (upper left), there’s some scatter around the line, but generally speaking, y increases with x (e.g., weight increases with height), and a line plotted through them would give you a good guess of y for any x. For the other sets…well, each has its own quirks: set two (upper right) isn’t a straight line, sets three (lower left) and four (lower right) have outliers – an odd value, which could be something really important (there are aliens among us), or a simple error (where did that decimal point go again?).