Sunday, 26 June 2016 17:41

Understanding Bayesian A/B testing (using baseball statistics)

Written by
Rate this item
(1 Vote)

Well, Mike Piazza has a slightly higher career batting average (2127 hits / 6911 at-bats = 0.308) than Hank Aaron (3771 hits / 12364 at-bats = 0.305). But can we say with confidence that his skill is actually higher, or is it possible he just got lucky a bit more often?

In this series of posts about an empirical Bayesian approach to batting statistics, we’ve been estimating batting averages by modeling them as a binomial distribution with a beta prior. But we’ve been looking at a single batter at a time. What if we want to compare two batters, give a probability that one is better than the other, and estimate by how much?

This is a topic rather relevant to my own work and to the data science field, because understanding the difference between two proportions is important in A/B testing. One of the most common examples of A/B testing is comparing clickthrough rates (“out of X impressions, there have been Y clicks”)- which on the surface is similar to our batting average estimation problem (“out of X at-bats, there have been Y hits””).1

Here, we’re going to look at an empirical Bayesian approach to comparing two batters.2 We’ll define the problem in terms of the difference between each batter’s posterior distribution, and look at four mathematical and computational strategies we can use to resolve this question. While we’re focusing on baseball here, remember that similar strategies apply to A/B testing, and indeed to many Bayesian models.



I’ll start with some code you can use to catch up if you want to follow along in R. If you want to understand what the code does, check out the previous posts. (All the code in this post, including for the figures where code isn’t shown, can be found here).


# Grab career batting average of non-pitchers
# (allow players that have pitched <= 3 games, like Ty Cobb)
pitchers <- Pitching %>%
  group_by(playerID) %>%
  summarize(gamesPitched = sum(G)) %>%
  filter(gamesPitched > 3)

career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(pitchers, by = "playerID") %>%
  group_by(playerID) %>%
  summarize(H = sum(H), AB = sum(AB)) %>%
  mutate(average = H / AB)

# Add player names
career <- Master %>%
  tbl_df() %>%
  select(playerID, nameFirst, nameLast) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID")

# Estimate hyperparameters alpha0 and beta0 for empirical Bayes
career_filtered <- career %>% filter(AB >= 500)
m <- MASS::fitdistr(career_filtered$average, dbeta,
                    start = list(shape1 = 1, shape2 = 10))

alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]

# For each player, update the beta prior based on the evidence
# to get posterior parameters alpha1 and beta1
career_eb <- career %>%
  mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0)) %>%
  mutate(alpha1 = H + alpha0,
         beta1 = AB - H + beta0) %>%

So let’s take a look at the two batters in question:

# while we're at it, save them as separate objects too for later:
aaron <- career_eb %>% filter(name == "Hank Aaron")
piazza <- career_eb %>% filter(name == "Mike Piazza")
two_players <- bind_rows(aaron, piazza)

## Source: local data frame [2 x 8]
##    playerID        name     H    AB average eb_estimate alpha1 beta1
##       (chr)       (chr) (int) (int)   (dbl)       (dbl)  (dbl) (dbl)
## 1 aaronha01  Hank Aaron  3771 12364   0.305       0.304   3847  8812
## 2 piazzmi01 Mike Piazza  2127  6911   0.308       0.306   2203  5003

We see that Piazza has a slightly higher average (), and a higher shrunken empirical bayes estimate (, where and are our priors).

But is Piazza’s true probability of getting a hit higher? Or is the difference due to chance? To answer, let’s consider the actual posterior distributions- the range of plausible values for their “true” batting averages after we’ve taken the evidence (their batting record) into account. Recall that these posterior distributions are modeled as beta distributions with the parameters .


two_players %>%
  inflate(x = seq(.28, .33, .00025)) %>%
  mutate(density = dbeta(x, alpha1, beta1)) %>%
  ggplot(aes(x, density, color = name)) +
  geom_line() +
  labs(x = "Batting average", color = "")


This posterior is a probabilistic representation of our uncertainty in each estimate. Thus, when asking the probability Piazza is better, we’re asking “if I drew a random draw from Piazza’s and a random draw from Aaron’s, what’s the probability Piazza is higher”?

Well, notice that those two distributions overlap a lot! There’s enough uncertainty in each of those estimates that Aaron could easily be better than Piazza.

Let’s throw another player in, retired Yankee Hideki Matsui:


Hideki Matsui is a fine batter (above average for major league baseball), but not up to the level of Aaron and Piazza: notice that his posterior distribution barely overlaps theirs. If we took a random draw from Matsui’s distribution and from Piazza’s, it’s very unlikely Matsui’s would be higher.

Posterior Probability

We may be interested in the probability that Piazza is better than Aaron within our model. We can already tell from the graph that it’s greater than 50%, but probably not much greater. How could we quantify it?

We’d need to know the probability one beta distribution is greater than another. This question is not trivial to answer, and I’m going to illustrate four routes that are common lines of attack in a Bayesian problem:

  • Simulation of posterior draws
  • Numerical integration
  • Closed-form solution
  • Closed-form approximation

Which of these approaches you choose depends on your particular problem, as well as your computational constraints. In many cases an exact closed-form solution may not be known or even exist. In some cases (such as running machine learning in production) you may be heavily constrained for time, while in others (such as drawing conclusions for a scientific paper) you care more about precision.

Simulation of posterior draws

If we don’t want to do any math today (I hear you), we could simply try simulation. We could use each player’s and parameters, draw a million items from each of them using rbeta, and compare the results:

piazza_simulation <- rbeta(1e6, piazza$alpha1, piazza$beta1)
aaron_simulation <- rbeta(1e6, aaron$alpha1, aaron$beta1)

sim <- mean(piazza_simulation > aaron_simulation)
## [1] 0.607

So about 60.7% probability Piazza is better than Aaron! An answer like this is often good enough, depending on your need for precision and the computational efficiency. You could turn up or down the number of draws depending on how much you value speed vs precision.

Notice we didn’t have to do any mathematical derivation or proofs. Even if we had a much more complicated model, the process for simulating from it would still have been pretty straightforward. This is one of the reasons Bayesian simulation approaches like MCMC have become popular: computational power has gotten very cheap, while doing math is as expensive as ever.


These two posteriors each have their own (independent) distribution, and together they form a joint distribution- that is, a density over particular pairs of and . That joint distribution could be imagined as a density cloud:


Here, we’re asking what fraction of the joint probability density lies below that black line, where Piazza’s average is greater than Aaron’s. Notice that more of it lies below than above: that’s confirming the posterior probability that Piazza is better is about 60%.

The way to calculate this quantitatively is numerical integration, which is how Chris Stucchio approaches the problem in this post and this Python script. A simple approach in R would be something like:

d <- .00002
limits <- seq(.29, .33, d)
sum(outer(limits, limits, function(x, y) {
  (x > y) *
    dbeta(x, piazza$alpha1, piazza$beta1) *
    dbeta(y, aaron$alpha1, aaron$beta1) *
    d ^ 2
## [1] 0.606

Like simulation, this is a bit on the “brute force” side. (And unlike simulation, the approach becomes intractable in problems that have many dimensions, as opposed to the two dimensions here).

Closed-form solution

You don’t need to be great at calculus to be a data scientist. But it’s useful to know how to find people that are great at calculus. When it comes to A/B testing, the person to find is often Evan Miller.

This post lays out a closed-form solution Miller derived for the probability a draw from one beta distribution is greater than a draw from another:

(Where is the beta function). If you’d like an intuition behind this formula… well, you’re on your own. But it’s pretty straightforward to implement in R (I’m borrowing notation from this post and calling it ):

h <- function(alpha_a, beta_a,
              alpha_b, beta_b) {
  j <-, round(alpha_b) - 1)
  log_vals <- (lbeta(alpha_a + j, beta_a + beta_b) - log(beta_b + j) -
               lbeta(1 + j, beta_b) - lbeta(alpha_a, beta_a))
  1 - sum(exp(log_vals))

h(piazza$alpha1, piazza$beta1,
  aaron$alpha1, aaron$beta1)
## [1] 0.608

Having an exact solution is pretty handy!3 So why did we even look at simulation/integration approaches? Well, the downsides are:

  • Not every problem has a solution like this. And even if it does, we may not know it. That’s why it’s worth knowing how to run a simulation. (If nothing else, they let us check our math!)
  • This solution is slow for large , and not straightforward to vectorize: notice that term that iterates from 0 to . If we run A/B tests with thousands of clicks, this step is going to constrain us (though it’s still usually faster than simulation or integration).

Closed-form approximation

As this report points out, there’s a much faster approximation we can use. Notice that when and are both fairly large, the beta starts looking a lot like a normal distribution, so much so that it can be closely approximated. In fact, if you draw the normal approximation to the two players we’ve been considering (shown as dashed line), they are visually indistinguishable:


And the probability one normal is greater than another is very easy to calculate- much easier than the beta!

h_approx <- function(alpha_a, beta_a,
                     alpha_b, beta_b) {
  u1 <- alpha_a / (alpha_a + beta_a)
  u2 <- alpha_b / (alpha_b + beta_b)
  var1 <- alpha_a * beta_a / ((alpha_a + beta_a) ^ 2 * (alpha_a + beta_a + 1))
  var2 <- alpha_b * beta_b / ((alpha_b + beta_b) ^ 2 * (alpha_b + beta_b + 1))
  pnorm(0, u2 - u1, sqrt(var1 + var2))

h_approx(piazza$alpha1, piazza$beta1, aaron$alpha1, aaron$beta1)
## [1] 0.607

This calculation is very fast, and (in R terms) it’s vectorizable.

The disadvantage is that for low or low , the normal approximation to the beta is going to fit rather poorly. While the simulation and integration approaches were inexact, this one will be systematically biased: in some cases it will always give too high an answer, and in some cases too low. But when we have priors and , as we do here, our parameters are never going to be low, so we’re safe using it.

Confidence and credible intervals

In classical (frequentist) statistics, you may have seen this kind of “compare two proportions” problem before, perhaps laid out as a “contingency table”:

Player Hits Misses
Hank Aaron 3771 8593
Mike Piazza 2127 4784

One of the most common classical ways to approach these contingency table problems is with Pearson’s chi-squared test, implemented in R as prop.test:

prop.test(two_players$H, two_players$AB)
## 	2-sample test for equality of proportions with continuity
## 	correction
## data:  two_players$H out of two_players$AB
## X-squared = 0.1, df = 1, p-value = 0.7
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.0165  0.0109
## sample estimates:
## prop 1 prop 2 
##  0.305  0.308

We see a non-significant p-value of .70. We won’t talk about p-values here (we talked a little about ways to translate between p-values and posterior probabilities in the last post). But we can agree it would have been strange if the p-value were significant, given that the posterior distributions overlapped so much.

Something else useful that prop.test gives you is a confidence interval for the difference between the two players. We learned in a previous post about credible intervals in terms of each player’s average. Now we’ll use empirical Bayes to compute the credible interval about the difference in these two players.

We could do this with simulation or integration, but let’s use our normal approximation approach (we’ll also compute our posterior probability while we’re at it):

credible_interval_approx <- function(a, b, c, d) {
  u1 <- a / (a + b)
  u2 <- c / (c + d)
  var1 <- a * b / ((a + b) ^ 2 * (a + b + 1))
  var2 <- c * d / ((c + d) ^ 2 * (c + d + 1))
  mu_diff <- u2 - u1
  sd_diff <- sqrt(var1 + var2)
  data_frame(posterior = pnorm(0, mu_diff, sd_diff),
             estimate = mu_diff,
             conf.low = qnorm(.025, mu_diff, sd_diff),
             conf.high = qnorm(.975, mu_diff, sd_diff))

credible_interval_approx(piazza$alpha1, piazza$beta1, aaron$alpha1, aaron$beta1)
## Source: local data frame [1 x 4]
##   posterior estimate conf.low conf.high
##       (dbl)    (dbl)    (dbl)     (dbl)
## 1     0.607 -0.00185  -0.0152    0.0115

It’s not particularly exciting for this Piazza/Aaron comparison (notice it’s very close to the confidence interval we calculated with prop.test). So let’s select 20 random players, and compare each of them to Mike Piazza. We’ll also calculate the confidence interval using prop.test, and compare them.


Notice the same pattern we saw in the credible intervals post. When we don’t have a lot of information about a player, their credible interval ends up smaller than their confidence interval, because we’re able to use the prior to adjust our expectations (Dad Lytle’s batting average may be lower than Mike Piazza’s, but we’re confident it’s not .25 lower). When we do have a lot of information, the credible intervals and confidence intervals converge almost perfectly.4

Thus, we can think of empirical Bayes A/B credible intervals as being a way to “shrink” frequentist confidence intervals, by sharing power across players.

Bayesian FDR control

Suppose we were a manager considering trading Piazza for another player (and suppose we could pick anyone from history). How many players in MLB history are we confident are better than Mike Piazza?

Well, we can compute the posterior probability and credible interval for each of them.

career_eb_vs_piazza <- bind_cols(
  credible_interval_approx(piazza$alpha1, piazza$beta1,
                           career_eb$alpha1, career_eb$beta1)) %>%
  select(name, posterior, conf.low, conf.high)

## Source: local data frame [10,388 x 4]
##                    name posterior conf.low conf.high
##                   (chr)     (dbl)    (dbl)     (dbl)
## 1               Ty Cobb  7.63e-17   0.0441    0.0716
## 2        Rogers Hornsby  2.81e-11   0.0345    0.0640
## 3  Shoeless Joe Jackson  8.42e-08   0.0279    0.0613
## 4           Ross Barnes  3.61e-05   0.0214    0.0633
## 5          Ed Delahanty  7.03e-07   0.0219    0.0518
## 6          Tris Speaker  1.57e-07   0.0225    0.0505
## 7          Ted Williams  1.45e-06   0.0206    0.0504
## 8        Billy Hamilton  6.91e-06   0.0190    0.0503
## 9          Levi Meyerle  4.01e-03   0.0087    0.0580
## 10       Harry Heilmann  7.15e-06   0.0180    0.0476
## ..                  ...       ...      ...       ...

While Piazza is an excellent batter, it looks like some do give him a run for his money. For instance, Ty Cobb has a batting average that’s about .04%-.07% better, with only a miniscule posterior probability that we’re wrong.

In order to get a set of players we’re confident are better (a set of candidates for trading), we can use the same approach to setting a false discovery rate (FDR) that we did in this post: taking the cumulative mean of the posterior error probability to compute the q-value.

career_eb_vs_piazza <- career_eb_vs_piazza %>%
  arrange(posterior) %>%
  mutate(qvalue = cummean(posterior))

Recall that the way we find a set of hypotheses with a given false discovery rate (FDR) is by filtering for a particular q-value:

better <- career_eb_vs_piazza %>%
  filter(qvalue < .05)

## Source: local data frame [65 x 5]
##                    name posterior conf.low conf.high   qvalue
##                   (chr)     (dbl)    (dbl)     (dbl)    (dbl)
## 1               Ty Cobb  7.63e-17   0.0441    0.0716 7.63e-17
## 2        Rogers Hornsby  2.81e-11   0.0345    0.0640 1.41e-11
## 3  Shoeless Joe Jackson  8.42e-08   0.0279    0.0613 2.81e-08
## 4          Tris Speaker  1.57e-07   0.0225    0.0505 6.03e-08
## 5          Ed Delahanty  7.03e-07   0.0219    0.0518 1.89e-07
## 6          Ted Williams  1.45e-06   0.0206    0.0504 3.99e-07
## 7         Willie Keeler  4.62e-06   0.0183    0.0473 1.00e-06
## 8        Billy Hamilton  6.91e-06   0.0190    0.0503 1.74e-06
## 9        Harry Heilmann  7.15e-06   0.0180    0.0476 2.34e-06
## 10           Lou Gehrig  1.43e-05   0.0167    0.0461 3.54e-06
## ..                  ...       ...      ...       ...      ...

This gives us 65 players we can say are better than Piazza, with an FDR of 5%. (That is, we expect we’re wrong on about 5% of this list of players).

What’s Next: Hierarchical modeling

We’re treating all baseball players as making up one homogeneous pool, whether they played in 1916 or 2016, and whether they played for the New York Yankees or the Chicago Cubs. This is mathematically convenient, but it’s ignoring a lot of information about players. One especially important piece of information it’s ignoring is how long someone played. Someone who’s been up to bat 5 or 6 times is generally not as good as someone with a 10-year career. This leads to a substantial bias where empirical Bayes tends to overestimate players with very few at-bats.

In the next post, we’ll talk about Bayesian hierarchical modeling, where rather than every player having the same prior, we allow the prior to depend on other known information. This will get us more accurate and reliable batting average estimates, while also extracting useful insights about factors that influence batters.

Read 4921 times


Latest Comments

K2 Content

  • A synergetic R-Shiny portal for modeling and tracking of COVID-19 data
    A synergetic R-Shiny portal for modeling and tracking of COVID-19 data

    Dr. Mahdi Salehi, an associate member of SDAT and assistant professor of statistics at the University of Neyshabur, introduced a useful online interactive dashboard that visualize and follows confirmed cases of COVID-19 in real-time. The dashboard was publicly made available on 6 April 2020 to illustrate the counts of confirmed cases, deaths, and recoveries of COVID-19 at the level of country or continent. This dashboard is intended as a user-friendly dashboard for researchers as well as the general public to track the COVID-19 pandemic, and is generated from trusted data sources and built-in open-source R software (Shiny in particular); ensuring a high sense of transparency and reproducibility.

    Access the shiny dashboard:

    Written on Friday, 08 January 2021 07:03 in SDAT News Read 3342 times Read more...
  • First Event on Play with Real Data
    First Event on Play with Real Data

    Scientific Data Analysis Team (SDAT) intends to organize the first event on the value of data to provide data holders and data analyzers with an opportunity to extract maximum value from their data. This event is organized by International Statistical Institute (ISI) and SDAT hosted at the Bu-Ali Sina University, Hamedan, Iran. 

    Organizers and the data providers will provide more information about the goals of the initial ideas, team arrangement, competition processes, and the benefits of attending this event on a webinar hosted at the ISI Gotowebianr system. Everyone invites to participate in this webinar for free, but it is needed to register at the webinar system by 30 December 2020. 

    Event Time: 31 December 2020 - 13:30-16:30 Central European Time (CET)

    Register for the webinar: 

    More details about this event: 

    Aims and outputs:

    • Playing with real data by explorative and predictive data analysis techniques 
    • A platform between a limited number of data providers and hundreds to thousands of data scientist Teams
    • Improving creativity and scientific reasoning of data scientist and statisticians 
    • Finding the possible “bugs” with the current data analysis methods and new developments
    • Learn different views about a dataset.


    The best-report awards consist of a cash prize:
    $400 for first place,
    $200 for second place, and
    $100 for third place.

    Important Dates: 

    Event Webinar: 31 December 2020 - 13:30-16:30 Central European Time (CET). 
    Team Arrangement: 01 Jan. 2021 - 07 Jan. 2021
    Competition: 10 Jan. 2021 - 15 Jan. 2021
    First Assessment Result: 25 Jan. 2021
    Selected Teams Webinar: 30 Jan. 2021
    Award Ceremony: 31 Jan. 2021

    Please share this event with your colleagues, students, and data analyzers. 

    Written on Wednesday, 23 December 2020 13:45 in SDAT News Read 3458 times Read more...
  • Development of Neuroimaging Symposium and Advanced fMRI Data Analysis
    Development of Neuroimaging Symposium and Advanced fMRI Data Analysis

    The Developement of Structural and Functional Neuroimaging Symposium hold at the School of Sciences, Shiraz University in April 17 2019.  The Advanced fMRI Data Analysis Workshop also held in April 18-19 2019. For more information please visit: 

    Written on Sunday, 21 April 2019 12:18 in SDAT News Read 3650 times Read more...
  • Releasing Rfssa Package by SDAT Members at CRAN
    Releasing Rfssa Package by SDAT Members at CRAN

    The Rfssa package is available at CRAN. Dr. Hossein Haghbin and Dr. Seyed Morteza Najibi (SDAT Members) have published this package to provide the collections of necessary functions to implement Functional Singular Spectrum Analysis (FSSA) for analysing Functional Time Series (FTS). FSSA is a novel non-parametric method to perform decomposition and reconstruction of FTS. For more information please visit github homepage of package. 

    Written on Sunday, 03 March 2019 21:03 in SDAT News Read 2727 times
  • Data Science Symposium
    Data Science Symposium

    Symposium of Data Science Developement and its job opportunities hold at the Faculty of Science, Shiraz University in Feb 20 2019. For more information please visit: 

    Written on Friday, 01 February 2019 00:13 in SDAT News Read 3981 times Read more...

About Us

SDAT is an abbreviation for Scientific Data Analysis Team. It consists of groups who are specialists in various fields of data sciences including Statistical Analytics, Business Analytics, Big Data Analytics and Health Analytics. 

Get In Touch

Address:  No.15 13th West Street, North Sarrafan, Apt. No. 1 Saadat Abad- Tehran

 Phone: +98-910-199-2800


Login Form