bankingciooutlook

Data Analysis in Storytelling: Common Mistakes

Albert Chin, Head of Model Risk Management, Signature Bank

Albert Chin, Head of Model Risk Management, Signature Bank

Analyzing data represents a unique form of art. It combines the discovery of patterns followed by a clear explanation of this phenomenon to the uninitiated. Among other things, a brilliant storyteller of data can trace the origin of a pandemic, locate the whereabouts of a wanted fugitive and determine which items in a massive warehouse need to be replenished without having to take inventory physically. Despite these accomplishments, some data analysts will provide misleading or useless information. Statisticians call this noise, but in laymen’s terms, it’s gibberish. The purpose of this article is to provide you with a simplistic example (summary statistics), highlight some common mistakes that arise from this approach then provide some solutions. My goal is to improve your abilities as a storyteller of data.

Summary Statistics

Summary statistics provide the audience with simplistic yet important metrics regarding a sample size or population. These metrics include the mean (average), median (midpoint or 50th percentile), minimum and maximum values, standard deviation (dispersion from the average), and frequency of values. The most important of these is the mean, so I’ll clarify the concept of average. For a particular variable, you collect the values of all participants then distribute that total uniformly among those participants. Let’s consider the Income variable, reflected in Table 1:

Table 1, Income($)

Income ($)

30,000

40,000

65,000

51,000

5,000,000,000

 

 

Applying the earlier definition produces Table 2, which shows $1,000,037,200 uniformly distributed among five participants.

Table 2 (Calculating the Mean of Income ($))

 

# of Participants

Total

Mean

5

5,000,186,000

         1,000,037,200

 

 

 

In addition to Income ($), let’s include two other variables, ID and Color in Table 3:

Table 3 (Random Neighborhood)

ID

Income ($)

Color

123

30,000

3

124

40,000

1

125

65,000

4

126

51,000

1

127

5,000,000,000

2

 

Table 3 is meant to reflect a neighborhood consisting of 5 homes with corresponding incomes and exterior color. ID represents an identifier variable; it’s used to distinguish one row (a record) from another because it’s possible that records can have the same values yet be distinct (i.e., two men can both be 25 years old and weigh 150 pounds). For the Color variable, 1= “Red”, 2= “Blue”, 3= “Green” and 4= “Yellow”. This is because categorical variables must be coded numerically, or rigorous statistical analysis (i.e., correlation, regression) cannot be performed on non-numeric variables. To clarify, categorical values mean that each of the values is independent of each (i.e., none of the colors overlap with each other).

In addition to the mean, four other summary statistics are shown in Table 4 for the Income variable:

Table 4 (Summary Statistics for Income of Random Neighborhood)

Mean

Median

Minimum

Maximum

Standard Deviation

1,000,037,200

51,000

30,000

5,000,000,000

            1,999,981,400

 

Since the listed Income values are unique (i.e., don’t repeat), the five values each have a frequency of 1.

Analyzing Table 2, we see a huge disparity between the mean (1,000,037,200) and the median (51,000). Conceptually, if extreme values (outliers) are not present, the mean and median would be much closer, if not the same. In addition, we know outliers are present, based on the following:

1) the disparity between the minimum (30,000) and maximum (5,000,000,000) and,

2) the standard deviation(1,999,981,400) is larger than the mean (1,000,037,200).

Before we can conclude that the $5B value is an outlier, we need to determine if it wasn’t coded in error (i.e., it should be $500K instead of $5B). For example, if income values originated from surveys, we would verify with those documents that the income values match those in the database. Assuming that the $5B amount is legitimate, I would use my summary statistics analysis as a launchpad to perform a more in-depth analysis in answering questions about income inequality as well as favorable zoning and tax abatement policies.

Common Mistakes

In Table 2, you’ll notice that I didn’t produce metrics for the ID and Color variables. Based on my earlier description of mean as well as its relationship to median, minimum, maximum and standard deviation, summary statistics on identifiers and categorical variables offers no valuable insight. To prove my point, Table 5 shows the summary statistics for ID and Color:

Table 5 (Summary Statistics of ID and Color)

 

Mean

Median

Minimum

Maximum

Standard Deviation

ID

125

125

123

127

                       1

Color

2

2

1

4

                       1

 

The conclusions from Table 5 would be nonsensical, to say the least. For ID, the mean is 125, which suggests this value be distributed uniformly among all 5 participants. This makes no sense because, as stated earlier, all ID values are distinct. This also indicates that revealing minimum and maximum values provides no benefits because there is no limit to distinct values in either direction. For Color, the mean is two, which translates to “Blue” is the final result of of mixing “Red,” “Blue,” “Green,” and “Yellow” together. Despite the ridiculous nature of this approach, some analysts actually do this and believe summary statistics on identifiers and categorical variables offer insight. Hint... it doesn’t!

Solutions

The problem with a “cookie cutter” approach is that data analysts are attempting to apply the same solution to variables that do not share the same characteristics. As mentioned in the beginning, data analysis is storytelling. A well-crafted story demonstrates discretion and thoughtfulness in its narrative. For data analysts, this translates to applying the appropriate metric to a given variable so a rational explanation can be presented. There’s nothing thoughtful or discretionary with presenting the average of colors or the standard deviation of ID numbers.

Let’s go back to the example that I provided earlier. Although performing mean, median, minimum, maximum, and standard deviation values on identifiers and categorical variables offer no viable information, applying frequency of values DOES provide insight. It identifies duplicate values and reveals the popularity of values based on the number of occurrences.

  1. Identifying duplicate values

For whatever reason, there might be unnecessary replication within the database. Performing a frequency of values allows the user to identify these duplicate records.

2) Popularity based on number of occurrences

Table 3 shows that there are two instances of Red with one instance each for the remaining three colors. As a home developer, I might use this insight to build more red-colored houses.

If performed properly, data analysis can help explain complex trends to the unfamiliar. I hope my walkthrough of summary statistics revealed silly mistakes and solutions for resolving those errors. By avoiding these pitfalls, you’ll craft a meaningful story for your audience.

Weekly Brief

Read Also

Cash Management: The Next Big Thing Which Has Already Been There

Cash Management: The Next Big Thing Which Has Already Been There

Susanne Prager, Head of Group Cash Management, Raiffeisen Bank International (RBI)
Bank Strategy In A Digital World: Three Imperatives For Incumbents

Bank Strategy In A Digital World: Three Imperatives For Incumbents

Shashank Khare, Head of Group Strategy, Lloyds Banking Group
Bringing Cx Close (R) To Our Heart

Bringing Cx Close (R) To Our Heart

Manpreet Singh, Senior MD and Head of Group Customer Experience, CIMB
Were introducing AI. Its just another technology implementation right?

Were introducing AI. Its just another technology implementation right?

Carole Ashford, Head of Change EPMO, Rabobank
Negative Stereotypes Of Accounts Receivable Financing

Negative Stereotypes Of Accounts Receivable Financing

Brad Agee, Vice President- Accounts Receivable Manager, Amegy Bank
Painting the Future with Technology and Strategic Planning

Painting the Future with Technology and Strategic Planning

Jeff Norris, CISSP; SVP, CIO, Seacoast Bank