Sunday, 26 June 2016 22:11
###
Understanding Bayesian A/B testing (using baseball statistics)

Written by Sayed Jamal Mirkamali

Well, Mike Piazza has a slightly higher career batting average (2127 hits / 6911 at-bats = 0.308) than Hank Aaron (3771 hits / 12364 at-bats = 0.305). But can we say with confidence that his skill is *actually* higher, or is it possible he just got lucky a bit more often?

In this series of posts about an empirical Bayesian approach to batting statistics, we’ve been estimating batting averages by modeling them as a binomial distribution with a beta prior. But we’ve been looking at a single batter at a time. What if we want to compare *two* batters, give a probability that one is better than the other, and estimate *by how much*?

This is a topic rather relevant to my own work and to the data science field, because understanding the difference between two proportions is important in **A/B testing**. One of the most common examples of A/B testing is comparing clickthrough rates (“out of X impressions, there have been Y clicks”)- which on the surface is similar to our batting average estimation problem (“out of X at-bats, there have been Y hits””).^{1}

Here, we’re going to look at an empirical Bayesian approach to comparing two batters.^{2} We’ll define the problem in terms of the difference between each batter’s posterior distribution, and look at four mathematical and computational strategies we can use to resolve this question. While we’re focusing on baseball here, remember that similar strategies apply to A/B testing, and indeed to many Bayesian models.

Published in
Data Science Tools

Tagged under

Wednesday, 15 June 2016 04:48
###
Python - Kernel tricks and nonlinear dimensionality reduction via RBF kernel PCA

Written by Sayed Jamal Mirkamali

Most machine learning algorithms have been developed and statistically validated for linearly separable data. Popular examples are linear classifiers like Support Vector Machines (SVMs) or the (standard) Principal Component Analysis (PCA) for dimensionality reduction. However, most real world data requires nonlinear methods in order to perform tasks that involve the analysis and discovery of patterns successfully.

The focus of this article is to briefly introduce the idea of kernel methods and to implement a Gaussian radius basis function (RBF) kernel that is used to perform nonlinear dimensionality reduction via BF kernel principal component analysis (kPCA).

Published in
Data Science Tools

Tagged under

Saturday, 11 June 2016 23:51
###
Machine Learning is dead - Long live machine learning!

Written by Sayed Jamal Mirkamali

You may be thinking that this title makes no sense at all. ML, AI, ANN and Deep learning have made it into the everyday lexicon and here I am, proclaiming that ML is dead. Well, here is what I mean…

The open sourcing of entire ML frameworks marks the end of a phase of rapid development of tools, and thus marks the death of ML as we have known it so far. The next phase will be marked with ubiquitous application of these tools into software applications. And that is how ML will live forever, because it will seamlessly and inextricably integrate into our lives.

Published in
Data Science Tools

Tagged under

Thursday, 09 June 2016 04:32
###
Rootograms, A new way to assess count models

Written by Sayed Jamal Mirkamali

Assessing the fit of a count regression model is not necessarily a straightforward enterprise; often we just look at residuals, which invariably contain patterns of some form due to the discrete nature of the observations, or we plot observed versus fitted values as a scatter plot. Recently, while perusing the latest statistics offerings on ArXiv I came across Kleiber and Zeileis (2016) who propose the *rootogram* as an improved approach to the assessment of fit of a count regression model. The paper is illustrated using R and the authors’ **countreg** package (currently on R-Forge only). Here, I thought I’d take a quick look at the rootogram with some simulated species abundance data.

Published in
Data Science Tools

The concept is certainly compelling. Having a machine capable of reacting to real-world visual, auditory or other type of data and then responding, in an intelligent way, has been the stuff of science fiction until very recently. We are now on the verge of this new reality with little general understanding of what it is that artificial intelligence, convolutional neural networks, and deep learning can (and can’t) do, nor what it takes to make them work. At the simplest level, much of the current efforts around deep learning involve very rapid recognition and classification of objects—whether visual, audible, or some other form of digital data. Using cameras, microphones and other types of sensors, data is input into a system that contains a multi-level set of filters that provide increasingly detailed levels of differentiation. Think of it like the animal or plant classification charts from your grammar school days: Kingdom, Phylum, Class, Order, Family, Genus, Species.

Published in
Data Science Tools

Tagged under

advanced statistical method
Book
BuAli Sina Universty
cognitive science
Conference
Hamedan
Industrial Statistics
Insurance Pricing
machine learning
MCMC Simulation
MIPIS96
MIPIS Conference
Modern Method
Modern Method in Insurance Pricing and Industrial Statistics
Neuroscience
news
NIAG
Official Statistics
R Learning
R Workshop
Scientific Data Analysis
sdat
Shahid Beheshti University
Shiraz University
Signal Processing
SQL
Statcog96
StatCog Workshop
Statistical Programming
workshop

SDAT is an abbreviation for *Scientific Data Analysis Team. *It consists of groups who are specialists in various fields of data sciences including Statistical Analytics, Business Analytics, Big Data Analytics and Health Analytics.

Address: No.15 13th West Street, North Sarrafan, Apt. No. 1 Saadat Abad- Tehran

Phone: +98-910-199-2800

Email: info@sdat.ir