As a customer service team leader or quality manager, you work hard to gather data and review your agents’ performance using your rigorous QA scorecard. You’re not superhuman, so you try your best to evaluate 1–3% of all conversations. This performance data informs employee feedback and impacts decisions around compensation and promotions, so it’s imperative that it is objective and representative.
In the following sections, we’ll share specific examples of how data biases can lead to misleading performance reviews.
You’re preparing for your coaching sessions with your new agents Carla and Karl. As you review the scorecard data, you see the following performance trends for core customer service skills. Based on this data, who is performing better?
It looks like Carla has had a better start. She is showing a significant improvement from Week 1 to Week 2. Karl’s performance looks worse, and you might be reconsidering your hiring decision. During the feedback session, Karl insists that his performance is much better and that an outlier ticket is the reason for the poor score.
Let’s take a more in-depth look at the underlying data. If you aggregate the data, Carla demonstrated good customer service skills in 10 out of 16 evaluations or 62.5% of the time, while Karl performed well on 11 of 15 tickets or 73% of the time.
This reversal of trends is an example of Simpson’s Paradox. It is a phenomenon in probability and statistics, in which a trend appears in different groups of data but disappears or reverses when these groups are combined. In this case, it now looks like Karl is a better agent. There are other factors at play when sampling and identifying trends on a small subset of data. At a sampling rate of 1%, ten reviews are equivalent to 1000 conversations. Small sampling sizes lead to higher variability. Also, as Karl protested, there can be sampling bias when the tickets selected for review do not represent the true population. There is no guarantee that random sampling will result in a sample representative of the population.
The law of probability theory states that as you perform more trials or evaluations, you approach the expected or true value. With the trends observed on 1% of the data, the reported behaviors can be very far from the actual performance. To get a more accurate assessment, the sample size needs to converge towards 100%. However, historically, the cost and effort of performing QA scales linearly with the number of tickets, so increasing the coverage rate is cost-prohibitive.
With machine learning, powerful models can do the heavy lifting and provide scalable and timely coverage of tickets and enable teams to reach 100% coverage. By doing so, AI will eliminate the data biases inherent in current QA processes, providing more accurate, valuable insights to agents.