Statistical & Financial Consulting by Stanford PhD
Home Page
STATISTICAL SOFTWARE

Almost all serious statistical analysis is done in one of the following packages: R (S-PLUS, RStudio), Matlab, SAS, SPSS and Stata. I have expertise in each of those packages but it does not mean that each of those packages is good for a specific type of analysis. In fact, for most advanced areas only 2-3 packages will be suitable, providing enough functionality or enough tools to implement this functionality easily. For example, a very important area of Markov Chain Monte Carlo is doable in R, Matlab and SAS only, unless you want to rely on convoluted macros written by random users on the web. The table at the end of this page compares the five packages in great detail.


R & MATLAB

R and Matlab are the richest systems by far. They contain an impressive collection of libraries, which is growing every day. Even if a desired specific model is not part of the standard functionality you can implement the model yourself, because R and Matlab are really programming languages with relatively simple syntaxes. As "languages" they allow you to express any idea. The question is whether you are a good writer or not. In terms of modern applied statistics tools, R libraries are somewhat richer than those of Matlab. Also R is free. On the flip side, Matlab has much better graphics, which you will not be ashamed to put in a paper or a presentation.


SPSS

On the other end of the spectrum is a package like SPSS. SPSS is quite narrow in its capabilities and allows you to do only about half of the mainstream statistics. It is quite useless for ambitious modeling and estimation procedures which are part of kernel smoothing, pattern recognition or signal processing. Nonetheless, SPSS is very popular among the practitioners because it does not require almost any training. All you have to do is hit several buttons and SPSS does all the calculations for you. In those cases when you need something standard, SPSS may have it implemented fully. The SPSS output will be quite detailed and visually pleasing. It will contain all the major tests and diagnostic tools associated with the method and will allow you to write an informative statistics section of your empirical analysis. In short, when the method is there, it is faster to run than a similar functionality in R or Matlab. So I use SPSS often for standard requests from my clients, like linear regression, ANOVA or principal components analysis. SPSS gives you the ability to program macros but that feature is quite inflexible.


SAS & STATA

Somewhere in-between R, Matlab and SPSS lie SAS and Stata. SAS is more extensive analytics than Stata. It is composed of dozens of procedures with massive, massive output, often covering more than ten pages. The idea of SAS is not to listen to you that much. It is like an old grandfather, whom you approach with a simple question but instead he tells you the story of his life. Many procedures contain three times more than what you need to know about that segment. So some time has to be spent on filtering in the relevant output. SAS procedures are invoked using simple scripts. Stata procedures can be invoked by clicking buttons in the menu or by running simple scripts. In the menu part, Stata resembles SPSS. Both SAS and Stata are programming languages, so they allow you to build analytics around standard procedures. Stata is somewhat more flexible than SAS. Still, in terms of programming flexibility, Stata and SAS do not come even close to R or Matlab. Selected strengths of SAS compared to all other packages: large data sets, speed, beautiful graphics, flexibility in formatting the output, time series procedures, counting processes. Selected strengths of Stata compared to all other packages: manipulation of survey data (stratified samples, clustering), robust estimation and tests, longitudinal data methods, multivariate time series.


THE TABLE

The following table compares standard functionality of the five packages in detail. By "standard" I mean

        1) built-in,
        2) readily available from official or widely known and reliable public web-sites, or
        3) attainable by relatively straightforward programming around built-in functions.

I use label "10+ code" if the required programming is more than 10 lines of code in more than 30% of typical projects.

 TYPE OF STATISTICAL ANALYSIS  MATLAB SAS  STATA   SPSS
           
 Nonparametric Tests  Yes  Yes  Yes  Yes  Yes
 T-test  Yes  Yes  Yes  Yes  Yes
 ANOVA & MANOVA  Yes  Yes  Yes  Yes  Yes
 ANCOVA & MANCOVA  Yes  Yes  Yes  Yes  Yes
 Linear Regression  Yes  Yes  Yes  Yes  Yes
 Generalized Least Squares  Yes  Yes  Yes   Yes  Yes 
 Ridge Regression  Yes  Yes  Yes   Limited  Limited
 Lasso  Yes  Yes  Yes   Limited  
 Generalized Linear Models  Yes  Yes  Yes  Yes  Yes
 Logistic Regression  Yes  Yes  Yes  Yes  Yes
 Mixed Effects Models  Yes  Yes  Yes  Yes  Yes
 Nonlinear Regression  Yes  Yes  Yes   Limited  Limited
 Quantile Regression  Yes  Yes  Yes   Yes  
 Discriminant Analysis  Yes  Yes  Yes   Yes   Yes 
 Nearest Neighbor  Yes  Yes  Yes     Yes 
 Naive Bayes  Yes  Yes      Limited
 Factor & Principal Components Analysis  Yes  Yes  Yes  Yes  Yes
 Canonical Correlation Analysis  Yes  Yes  Yes  Yes  Yes
 Copula Models  Yes  Yes  Limited    
 Path Analysis  Yes  Yes  Yes  Yes  Yes
 Structural Equation Modeling (Latent Factors)  Yes  10+ code  Yes  Yes  AMOS
 Extreme Value Theory  Yes  Yes      
 Variance Stabilization  Yes  Yes      
 Bayesian Statistics  Yes  Yes  Limited  Limited  
 Monte Carlo, Classic Methods  Yes  Yes  Yes   Yes   Limited
 Markov Chain Monte Carlo  10+ code  10+ code  10+ code    
 EM Algorithm  10+ code  10+ code  10+ code    
 Missing Data Imputation  Yes  Yes  Yes   Yes   Yes 
 Bootstrap & Jackknife  Yes  Yes  Yes   Yes   Yes
 Outlier Diagnostics  Yes  Yes  Yes   Yes   Yes
 Robust Estimation  Yes  Yes  Yes   Yes 
 Cross-Validation  Yes  Yes  Yes  Yes  
 Longitudinal (Panel) Data  Yes  Yes  Yes   Yes   Limited
 Survival Analysis  Yes  Yes  Yes   Yes   Yes 
 Propensity Score Matching  Yes  Yes  Limited   Yes  
 Stratified Samples (Survey Data)  Yes  Yes  Yes   Yes   Yes 
 Experimental Design  Yes  Yes  Limited    
 Quality Control  Yes  Yes  Yes  Yes   Yes 
 Reliability Theory  Yes  Yes  Yes   Yes   Yes
 Univariate Time Series  Yes  Yes  Yes   Yes   Limited
 Multivariate Time Series  Yes  Yes  Yes   Yes   
 Stochastic Volatility Models, Discrete Case  Yes  Yes  Yes  Yes  Limited
 Stochastic Volatility Models, Continuous Case  Yes  Yes  Limited  Limited  
 Diffusions  10+ code  10+ code      
 Markov Chains  10+ code  10+ code      
 Hidden Markov Models  Yes  Yes    Yes  
 Counting Processes  Yes  Yes  Yes     
 Filtering  Yes  Yes  Limited   Limited  
 Instrumental Variables  Yes  Yes  Yes  Yes   Yes
 Simultaneous Equations  Yes  Yes  Yes   Yes   AMOS
 Splines  Yes  Yes  Yes   Yes  
 Nonparametric Smoothing Methods  Yes  Yes  Yes   Yes   
 Spatial Statistics  Yes  10+ code  Limited    Limited
 Cluster Analysis  Yes  Yes  Yes   Yes   Yes 
 Neural Networks  Yes  Yes  Yes     Limited
 Classification & Regression Trees  Yes  Yes  Yes     Limited
 Boosting  Yes  Yes  Limited    
 Random Forests  Yes  Yes  Limited    
 Support Vector Machines  Yes  Yes  Yes    
 Signal Processing  Yes  Yes  Limited    
 Wavelet Analysis  Yes  Yes  Yes    
 Bagging  Yes  Yes  Yes  Yes  
 Meta-analysis  Yes  10+ code  Limited  Yes  
 ROC Curves  Yes  Yes  Yes   Yes   Yes 
 Deterministic Optimization  Yes  Yes  Yes   Limited  
 Stochastic Optimization  Yes  Yes  Limited    


Please read the detailed description of the services offered in the areas of statistical consulting and financial consulting: home page, types of service, experience, case studies and payment options. You may also find the following pages useful: statistics resources and finance resources.