How to read a box plot
Box plots are used to summarise numerical data. They are a way to visualise the five number summary of a data series of numbers.
Box Plots are sometimes called Box and Whisker Plots because they are made up from a box and two whiskers. The box describes the middle part of the data while the whiskers describe the data tails and sometimes tell us about outliers.
What does the box show in a box plot
The box in a Box Plot extends from the first quartile (Q1) to the third Quartile (Q3). This means the box contains the middle 50% of the data set.
The distance between the bottom of the box (Q1) and the top of the box (Q3) is the Inter Quartile Range (IQR). The Inter Quartile Range can be used as a measure of spread of the data set.
A large Inter Quartile Range suggests that the dataset might be quite spread out. While a small Inter Quartile Range suggests that the data set might be clustered around the median.
What do the whiskers show in a box plot
The whiskers show the outer 50% of the data set. The top whisker represents the top 25% of all data. The bottom whisker represents the bottom 25% of all data.
The whiskers can become really long if they are drawn to the minimum and maximum values of a data set and there are outliers. This can make it really hard to read a Box Plot because the boxes get squashed.
How to shorten the whiskers in a Box and Whisker Plot
The whiskers in a Box and Whisker graph can be made shorter by removing outliers or drawing them to the 5th and 95th percentile instead of minimum and maximum values.
Drawing whiskers to the 5th and 95th percentiles can shorten them a lot if you have some extreme values outside of these percentiles. An advantage of this approach is that percentiles are fairly standard way to compare data sets. A downside is that values inside the 5th and 95th percentiles can still make the whiskers really long.
Another approach is to limit the whiskers to some multiple of the Inter Quartile Range. For example, any values that are more than Q3 + 1.5 x IQR or less than Q1 - 1.5 x IQR get marked as outliers and excluded from the plot.
This approach means that the ratio of whisker to box size can be controlled, so the box never gets squashed. One disadvantage is that whiskers will be clipped at Q3 + 1.5 x IQR and Q1 - 1.5 x IQR which are not standard ways to compare data sets, and may lead to confusion or mis-interpretation.