THANK YOU FOR SUBSCRIBING
Analyzing data represents a unique form of art. It combines the discovery of patterns followed by a clear explanation of this phenomenon to the uninitiated. Among other things, a brilliant storyteller of data can trace the origin of a pandemic, locate the whereabouts of a wanted fugitive and determine which items in a massive warehouse need to be replenished without having to take inventory physically. Despite these accomplishments, some data analysts will provide misleading or useless information. Statisticians call this noise, but in laymen’s terms, it’s gibberish. The purpose of this article is to provide you with a simplistic example (summary statistics), highlight some common mistakes that arise from this approach then provide some solutions. My goal is to improve your abilities as a storyteller of data.
Summary Statistics
Summary statistics provide the audience with simplistic yet important metrics regarding a sample size or population. These metrics include the mean (average), median (midpoint or 50^{th} percentile), minimum and maximum values, standard deviation (dispersion from the average), and frequency of values. The most important of these is the mean, so I’ll clarify the concept of average. For a particular variable, you collect the values of all participants then distribute that total uniformly among those participants. Let’s consider the Income variable, reflected in Table 1:
Table 1, Income($)




Applying the earlier definition produces Table 2, which shows $1,000,037,200 uniformly distributed among five participants. Table 2 (Calculating the Mean of Income ($))





In addition to Income ($), let’s include two other variables, ID and Color in Table 3:
Table 3 (Random Neighborhood)
ID 
Income ($) 
Color 
123 
30,000 
3 
124 
40,000 
1 
125 
65,000 
4 
126 
51,000 
1 
127 
5,000,000,000 
2 
Table 3 is meant to reflect a neighborhood consisting of 5 homes with corresponding incomes and exterior color. ID represents an identifier variable; it’s used to distinguish one row (a record) from another because it’s possible that records can have the same values yet be distinct (i.e., two men can both be 25 years old and weigh 150 pounds). For the Color variable, 1= “Red”, 2= “Blue”, 3= “Green” and 4= “Yellow”. This is because categorical variables must be coded numerically, or rigorous statistical analysis (i.e., correlation, regression) cannot be performed on nonnumeric variables. To clarify, categorical values mean that each of the values is independent of each (i.e., none of the colors overlap with each other).
In addition to the mean, four other summary statistics are shown in Table 4 for the Income variable:
Table 4 (Summary Statistics for Income of Random Neighborhood)
Mean 
Median 
Minimum 
Maximum 
Standard Deviation 
1,000,037,200 
51,000 
30,000 
5,000,000,000 
1,999,981,400 
Since the listed Income values are unique (i.e., don’t repeat), the five values each have a frequency of 1.
Analyzing Table 2, we see a huge disparity between the mean (1,000,037,200) and the median (51,000). Conceptually, if extreme values (outliers) are not present, the mean and median would be much closer, if not the same. In addition, we know outliers are present, based on the following:
1) the disparity between the minimum (30,000) and maximum (5,000,000,000) and,
2) the standard deviation(1,999,981,400) is larger than the mean (1,000,037,200).
Before we can conclude that the $5B value is an outlier, we need to determine if it wasn’t coded in error (i.e., it should be $500K instead of $5B). For example, if income values originated from surveys, we would verify with those documents that the income values match those in the database. Assuming that the $5B amount is legitimate, I would use my summary statistics analysis as a launchpad to perform a more indepth analysis in answering questions about income inequality as well as favorable zoning and tax abatement policies.
Common Mistakes
In Table 2, you’ll notice that I didn’t produce metrics for the ID and Color variables. Based on my earlier description of mean as well as its relationship to median, minimum, maximum and standard deviation, summary statistics on identifiers and categorical variables offers no valuable insight. To prove my point, Table 5 shows the summary statistics for ID and Color:
Table 5 (Summary Statistics of ID and Color)

Mean 
Median 
Minimum 
Maximum 
Standard Deviation 
ID 
125 
125 
123 
127 
1 
Color 
2 
2 
1 
4 
1 
The conclusions from Table 5 would be nonsensical, to say the least. For ID, the mean is 125, which suggests this value be distributed uniformly among all 5 participants. This makes no sense because, as stated earlier, all ID values are distinct. This also indicates that revealing minimum and maximum values provides no benefits because there is no limit to distinct values in either direction. For Color, the mean is two, which translates to “Blue” is the final result of of mixing “Red,” “Blue,” “Green,” and “Yellow” together. Despite the ridiculous nature of this approach, some analysts actually do this and believe summary statistics on identifiers and categorical variables offer insight. Hint... it doesn’t!
Solutions
The problem with a “cookie cutter” approach is that data analysts are attempting to apply the same solution to variables that do not share the same characteristics. As mentioned in the beginning, data analysis is storytelling. A wellcrafted story demonstrates discretion and thoughtfulness in its narrative. For data analysts, this translates to applying the appropriate metric to a given variable so a rational explanation can be presented. There’s nothing thoughtful or discretionary with presenting the average of colors or the standard deviation of ID numbers.
Let’s go back to the example that I provided earlier. Although performing mean, median, minimum, maximum, and standard deviation values on identifiers and categorical variables offer no viable information, applying frequency of values DOES provide insight. It identifies duplicate values and reveals the popularity of values based on the number of occurrences.
For whatever reason, there might be unnecessary replication within the database. Performing a frequency of values allows the user to identify these duplicate records.
2) Popularity based on number of occurrences
Table 3 shows that there are two instances of Red with one instance each for the remaining three colors. As a home developer, I might use this insight to build more redcolored houses.
If performed properly, data analysis can help explain complex trends to the unfamiliar. I hope my walkthrough of summary statistics revealed silly mistakes and solutions for resolving those errors. By avoiding these pitfalls, you’ll craft a meaningful story for your audience.
Read Also