Blog

Thursday, 09 June 2016 00:02

Rootograms, A new way to assess count models

Written by
Rate this item
(1 Vote)
Rootogram

Assessing the fit of a count regression model is not necessarily a straightforward enterprise; often we just look at residuals, which invariably contain patterns of some form due to the discrete nature of the observations, or we plot observed versus fitted values as a scatter plot. Recently, while perusing the latest statistics offerings on ArXiv I came across Kleiber and Zeileis (2016) who propose the rootogram as an improved approach to the assessment of fit of a count regression model. The paper is illustrated using R and the authors’ countreg package (currently on R-Forge only). Here, I thought I’d take a quick look at the rootogram with some simulated species abundance data.

Start by simulating some data to work with. Here I use my coenocliner package, and simulate three data sets, each of which uses the same environmental gradient, but with counts drawn from the following distributions

1. Poisson
2. Negative binomial
3. Zero-inflated negative binomial

To follow along here you’ll need the latest version of coenocliner from CRAN (>= 0.2-2) as a bit of a bug entered into my code when changing between parameterizations of the negative binomial.

Now we can simulate counts for the 100 locations along the gradient for each of the three count models

and combine them into a data frame with the gradient locations

To each of these I’m going to fit a Poisson GLM to show how rootograms can facilitate model evaluation where we know what the underlying model is so we can see what might happen when the wrong model, in this case a Poisson GLM, is fitted to data

In each case, a Poisson GLM was fitted even though we knew that for yNegBin and yZINB that the data generating process was not the Poisson.1

Next, generate rootograms for each of these models. I start by loading the countreg package as well as ggplot2, as I’ll plot the rootograms using the latter rather than base graphics.

Load the countreg package and ggplot. If you don’t have countreg installed, install it from R Forge using install.packages("countreg", repos="http://R-Forge.R-project.org")

Rootograms are calculated using the rootogram() function. You can provide the observed and expected (given the model) counts as arguments to rootogram() or, most usefully for our purposes, a fitted count model object from which the relevant values will be extracted. rootogram() knows about glm, gam, gamlss, hurdle, and zeroinfl objects at the time of writing.

Three different kinds of rootograms are discussed in the paper

1. Standing,
2. Hanging, and
3. Suspended.

Kleiber and Zeileis (2016) recommend hanging or suspended rootograms, for reasons I’ll mention shortly. Which type of rootogram is produced is controlled via argument style. The final option I use below is plot = FALSE, which suppresses plotting of the rootogram as I want to do that later using ggplot.

Generate the three rootograms

and gather them into an object for plotting — notice I’m using the autoplot() method to generate ggplot2 plot objects, and adjusting the limits to make the plots comparable. The resulting figure is shown below the code

Looking first at panel a we see the main features of the rootogram:

• expected counts, given the model, are shown by the thick red line,
• observed counts are shown as bars, which in a hanging rootogram are show hanging from the red line of expected counts,
• on the x-axis we have the count bin, 0 count, 1 count, 2 count, etc,
• on the y-axis we have the square root of the observed or expected count — the square root transformation allows for departures from expectations to be seen even at small frequencies
• A reference line is drawn at a height of 0

Because this is a hanging rootogram, we can think of the rootogram as relating to the fitted counts — if a bar doesn’t reach the zero line then the model over predicts a particular count bin, and if the bar exceeds the zero line it under predicts.

For the Poisson GLM fitted to counts generated from a Poisson distribution (panel a) we see general good agreement between the expected and observed counts, with a small amount of under prediction of some counts between 10–20. For the Poisson GLM fitted to the data generated from a negative binomial distribution (panel b) we see a much poorer fit — the zero count is under predicted whilst some low counts are over predicted, and a large number of count bins are under predicted between 4 and 10 counts. Focusing on the bottom of the bars we see an undulating pattern with runs either above or below the zero reference line, highlighting a general lack of fit in the model.

The fit of the Poisson GLM to data generated using a ZINB also shows considerable model lack of fit; strong under prediction of the zero bin and over prediction of the 1 count bin, with perhaps some general over prediction across most bins.

It is useful to compare rootograms showing the fits for incorrect and correct models side by side. To that end next I fit a negative binomial GLM and a ZINB using the glm.nb() function from package MASS and the zeroinfl() function from package countreg respectively, and create the relevant rootograms

First, we look at the negative binomial data and compare rootograms of the Poisson and negative binomial model fits

The rootogram for the negative binomial GLM fit (panel b) shows much better agreement with the data than that of the Poisson fit (panel a). Departures from expected counts are much smaller and the zero-count bin is much better fitted. Some small deviations from the observed data remain but that is to be expected.

Next we compare rootograms for the fits of the Poisson GLM and ZINB model

The rootogram for the ZINB model (panel b) shows better agreement with the zero-count bin than the Poisson model (panel a), though fits for the remaining count bins are similar to one another in both models. In particular, the ZINB model is still over predicting single counts.

Suspended rootograms are also recommended by Kleiber and Zeileis (2016). These rootograms show the difference between observed and expected counts, with bars hanging from the zero-line rather than the expected count line. Therefore we can think of this rootogram as showing information about the model residuals rather than the fitted values of the hanging rootogram. A suspended rootogram is produced using style = "suspended" and an example, for the ZINB model, is shown below

Standing histograms are not recommended by Kleiber and Zeileis (2016) as they simply show the expected and observed counts and the user then has to compare the height of each bar with the expected curve for each bin. By tying the bars to the expected curve or zero reference line in hanging or suspended rootograms, the assessment of fit is made by comparison of deviations from the reference line rather than bin-by-bin comparison of observed and expected counts. A standing rootogram, for completeness, is shown below for the ZINB model

A neat feature of the countreg package is that rootograms can be combined using the c() or cbind() methods, which makes plotting multiple rootograms much more simple than I showed above. For example, to compare the Poisson and negative binomial model fits to the negative binomial counts one could have used

So, there we go; these are rootograms and they seem like a pretty useful tool for assessing fits of count models. I really recommend having a look at Kleiber and Zeileis (2016) as it contains much more discussion and illustration of the proposed rootograms than I could possibly include here. They also have a nice ecological example of data from an investigation into horseshoe crab mating plus two other examples. Their paper will shortly appear in the journal The American Statistician, although at the time of writing I don’t have citation details for that version of the paper.

References

Kleiber, Christian, and Achim Zeileis. 2016. “Visualizing Count Data Regressions Using Rootograms,” 4~may.

K2 Content

• A synergetic R-Shiny portal for modeling and tracking of COVID-19 data

Dr. Mahdi Salehi, an associate member of SDAT and assistant professor of statistics at the University of Neyshabur, introduced a useful online interactive dashboard that visualize and follows confirmed cases of COVID-19 in real-time. The dashboard was publicly made available on 6 April 2020 to illustrate the counts of confirmed cases, deaths, and recoveries of COVID-19 at the level of country or continent. This dashboard is intended as a user-friendly dashboard for researchers as well as the general public to track the COVID-19 pandemic, and is generated from trusted data sources and built-in open-source R software (Shiny in particular); ensuring a high sense of transparency and reproducibility.

Access the shiny dashboard: https://mahdisalehi.shinyapps.io/Covid19Dashboard/

Written on Friday, 08 January 2021 07:03 in SDAT News Read 1842 times Read more...
• First Event on Play with Real Data

Scientific Data Analysis Team (SDAT) intends to organize the first event on the value of data to provide data holders and data analyzers with an opportunity to extract maximum value from their data. This event is organized by International Statistical Institute (ISI) and SDAT hosted at the Bu-Ali Sina University, Hamedan, Iran.

Organizers and the data providers will provide more information about the goals of the initial ideas, team arrangement, competition processes, and the benefits of attending this event on a webinar hosted at the ISI Gotowebianr system. Everyone invites to participate in this webinar for free, but it is needed to register at the webinar system by 30 December 2020.

Event Time: 31 December 2020 - 13:30-16:30 Central European Time (CET)

Register for the webinar: https://register.gotowebinar.com/register/8913834636664974352

Aims and outputs:

• Playing with real data by explorative and predictive data analysis techniques
• A platform between a limited number of data providers and hundreds to thousands of data scientist Teams
• Improving creativity and scientific reasoning of data scientist and statisticians
• Finding the possible “bugs” with the current data analysis methods and new developments
• Learn different views about a dataset.

AWARD-WINNING:

The best-report awards consist of a cash prize:
$400 for first place,$200 for second place, and
\$100 for third place.

Important Dates:

Event Webinar: 31 December 2020 - 13:30-16:30 Central European Time (CET).
Team Arrangement: 01 Jan. 2021 - 07 Jan. 2021
Competition: 10 Jan. 2021 - 15 Jan. 2021
First Assessment Result: 25 Jan. 2021
Selected Teams Webinar: 30 Jan. 2021
Award Ceremony: 31 Jan. 2021

Written on Wednesday, 23 December 2020 13:45 in SDAT News Read 2138 times Read more...
• Development of Neuroimaging Symposium and Advanced fMRI Data Analysis

The Developement of Structural and Functional Neuroimaging Symposium hold at the School of Sciences, Shiraz University in April 17 2019.  The Advanced fMRI Data Analysis Workshop also held in April 18-19 2019. For more information please visit: http://sdat.ir/dns98

Written on Sunday, 21 April 2019 12:18 in SDAT News Read 2323 times Read more...
• Releasing Rfssa Package by SDAT Members at CRAN

The Rfssa package is available at CRAN. Dr. Hossein Haghbin and Dr. Seyed Morteza Najibi (SDAT Members) have published this package to provide the collections of necessary functions to implement Functional Singular Spectrum Analysis (FSSA) for analysing Functional Time Series (FTS). FSSA is a novel non-parametric method to perform decomposition and reconstruction of FTS. For more information please visit github homepage of package.

Written on Sunday, 03 March 2019 21:03 in SDAT News Read 1805 times
• Data Science Symposium

Symposium of Data Science Developement and its job opportunities hold at the Faculty of Science, Shiraz University in Feb 20 2019. For more information please visit: http://sdat.ir/dss97

Written on Friday, 01 February 2019 00:13 in SDAT News Read 2933 times Read more...

Tags

SDAT is an abbreviation for Scientific Data Analysis Team. It consists of groups who are specialists in various fields of data sciences including Statistical Analytics, Business Analytics, Big Data Analytics and Health Analytics.