Category: Software

The popularity of statistics software in (UK) academic job market

The Popularity of Data Science Software is a great article showing popularity of various data analytic software from different perspectives. Inspired by this article, I want to look at something I am personally interested: the popularity of statistics software in (UK) academic job market.

While the article mentioned above discusses popularity in job market in general, as well as scholarly articles, it does not look at the academic job market. Another reason to re-look at this issue is that data science/analytic/analysis/statistical… jobs are different. An example to illustrate this is Microsoft Excel which was not included in the article’s job market analysis. Excel is nevertheless widely used. If you use “Excel” and search terms used in their article and try today, you would find Excel is the third most popular software in data jobs advertised on Indeed, just behind Python and SQL. Seriously, I was turned down for a data analytic job because I did not do well in their Excel test during interview. However, it seems that jobs heavily using Excel are different kinds of jobs than those using Python/SQL/R etc., and a candidate of experienced Python users may not want to work for those Excel jobs anyway.

Another example is SAS. There is a kind of job titled as “SAS Programmer”. However, having worked and interacted with many statisticians in many UK academic institutions, I have never heard of or can imagined any researcher/statistician would consider themselves a “programmer” (be it SAS programmer or R programmer). Do they really want a “programmer” job? Similarly, many economists are good at data and statistics, but I would never call them a “Stata Programmer” (Stata seems very popular among economists). Would Nate Silver (who had a degree in economics from the University of Chicago and used to work as an economic consultant) call himself a “Stata Programmer”? Stata advocates itself “for researchers; by researchers”. Now as a researcher myself I am interested in how popular of major statistics software in academic job market, including the top used software used in scholarly articles (SPSS, R, SAS, Stata) and top software in general job market (Python).

To investigate this, I used http://www.jobs.ac.uk, the major academic job advertisement website. As you can see from the domain, it is mainly for UK but institutions from other countries do post jobs regularly. The majority of jobs are by universities but companies do occasionally post research jobs here. The website does have its built in search function but its search for R is impossible, so I will use Google to search this site. The search terms are the software name plus the data related terms (these are slightly different than this guide partially because Google has limit on number of words to be searched and also some term is irreverent in academia such as “business analytics” so they were removed):

site:jobs.ac.uk software AND ("big data" OR "data analytics" OR "machine learning" OR "statistical analysis" OR "data mining" OR "data science" OR "data scientist" OR "statistical software" OR "predictive analytics" OR "artificial intelligence" OR "predictive modelling" OR "statistical modelling" OR "statistical tools" OR "statistician")

For Stata and SPSS they are the easiest for obvious reasons and their names are used as such. For Python, since it is a general purpose programming language I only search its main machine learning library “scikit-learn”. Some academic jobs list statistics along with many other quantitative skills in an ad such as mathematics, physics etc and Python is often one of the languages along with Java, C++, so it is not clear how the job would use Python so I think a search for “scikit-learn” makes more sense. For SAS, it could mean many things. I have seen School of Advanced Study, Statistical Advisory Service, Student Administration and Support Services… all abbreviated as SAS. In particular Statistical Advisory Service may mess the results a little. Finally for R, I used the following: (” R ” OR ” R,”). Note this is far from perfect since I saw someone’s name such as “Firstname R Surname” or email address (r.surname@someemail.com) on the search results. Thus it is likely the number of R jobs has been greatly inflated.

The following is the search result done today:

Number of jobs
SPSS29
R262
SAS59
Stata64
Python (scikit-learn)5

As discussed above it is important to differentiate different kinds jobs. In particular, terms like “data science” or “artificial intelligence” may imply dealing with just (big) data itself or robot, a more computer science thing instead of trying to make sense of data. Now let’s re-do the above search but without those data science or AI terms, i.e. using:

site:jobs.ac.uk software AND ("data analytics" OR "machine learning" OR "statistical analysis" OR "data mining" OR "statistical software" OR "predictive analytics" OR "predictive modelling" OR "statistical modelling" OR "statistical tools" OR "statistician")
Number of jobs
SPSS29
R167
SAS41
Stata58
Python (scikit-learn) 5

It seems R is the most popular but the actual number may be greatly inflated. Stata is the second most popular statistical software in (the UK) academic job market, followed by SAS, SPSS, and Python (scikit-learn).

The curse of R packages and why R may not be your preferred first language

A recent blog post discussed an interesting position of R as one’s second language (“R Should Be Your Second Language (If It’s Not Already Your First)“). While I very much agree that R can be a great second language, I have my reservation for it being one’s first language.

We all know that it is easier to learn a new language if you already know a language. However R is difficult to learn to start with. This has been greatly discussed in various posts such as this and this. Another good read is Mitchell, M. N. (2007) Strategically using General Purpose Statistics Packages: A Look at Stata , SAS and SPSS, Technical Report Series, which you can found here. Although it mainly discusses Stata, SAS, and SPSS, it has a short section on R.

While it is not news that people complaining about difficult to learn R, one advantage of R that R users often boast is it has so many packages. Looking at CRAN page, at the time of writing, it states there are 14174 packages. Even considering many of them are overlapping in functionality and they are needed really because the base R is pretty limited, this is still an impressive number. While this is definitely an advantage, this also causes problems for learners.

Firstly users are easily overwhelmed and confused by the large number of packages. For example, the receiver operating characteristic (ROC) curve analysis is a common procedure and included as standard in statistic software such as Stata, SAS, and SPSS, but not in base R. So I searched by typing “ROC…” in RStudio trying to find a package to this – it returns 11 packages. Which one does ROC and which one is the commonly used one? Not all these packages does ROC, such as the “rococo” package. I tried another one but could not make sense of it. After some searches in Bing/Google, and some play and trials, it turns out a popular package ‘pROC’ (not in the list above as it does not start with ‘ROC’) seems best (it resembles the similar functions in Stata). If you are learning R as your first language, you will likely spend a lot of time trying to figure out what packages to use even for a very basic statistical procedure. Of course, if your statistical works are very narrow to limited statistical procedures and then I suppose you could do this kind of research once and this may be acceptable.

Another problem of these packages they do not integrate well. Let’s use linear regression as an example. Almost all statistical packages do linear regression via OLS. But what if I want to estimate the linear model using say Bayesian method (without having to fully specify the data distribution etc as in Stan or WinBUGS)? In Stata, this is effortless by simply adding a ‘bayes’ prefix:

regress y x1 x2             // OLS 
bayes: regress y x1 x2      // Bayesian method 

In SAS, this is also fairly simple, by adding a ‘bayes’ statement:

proc genmod data=data;
  model y = x1 X2 / dist=normal;
  bayes;
run;

However, you could not simply add a bayes option in ‘lm’, the standard linear model function (or glm function as in SAS), in R. Something like the following does not work.

lm(y ~ x1 + x2, data=data, bayes=T)

The closest thing I found to similar solution in Stata or SAS is the ‘MCMCregress’ function in ‘MCMCpack’ pakcage. Of course, it is your responsibility to do the homework and research to find this package and this particular function. At least it is lucky that MCMCregress has similar syntax as to lm. The cv.glmnet in glmnet package for example does not accept the usual formula but has to be specified as cv.glmnet(x, y, …) where x is the matrix of predictors. And in the example below, the arima function, though part of base R, again has different specification altogether: it has the form of arima(x, order = c(0L, 0L, 0L), … xreg = …), where bizarrely x is the dependent variable and covariates were specified as matrix after xreg.

Another integration problem is that you are vulnerable to statistic mistakes. This time we use ARIMA model as an example. Suppose we want to run this ARIMA(1,1,1) model in Stata:

. webuse wpi1
. arima wpi, arima(1,1,1) noconstant nolog
ARIMA regression
Sample:  1960q2 - 1990q4                        Number of obs     =        123
                                                 Wald chi2(2)      =    1258.91
Log likelihood = -137.2468                      Prob > chi2       =     0.0000
 

             |                 OPG
       D.wpi |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
ARMA         |
          ar |
         L1. |   .9411698   .0296928    31.70   0.000     .8829731    .9993666
             |
          ma |
         L1. |  -.4656191   .0876255    -5.31   0.000    -.6373619   -.2938763
-------------+----------------------------------------------------------------
      /sigma |   .7347324   .0364759    20.14   0.000     .6632411    .8062238
Note: The test of the variance against zero is one sided, and the two-sided
      confidence interval is truncated at zero.

This can be easily done in the base R and we use the ‘lmtest’ package to obtain z-values for significant tests.

>   library(readstata13)
>   data=read.dta13("http://www.stata-press.com/data/r14/wpi1.dta")
>   model=arima(ts(data$wpi),order=c(1,1,1))
>   model 
 Call:
 arima(x = ts(data$wpi), order = c(1, 1, 1))
 Coefficients:
          ar1      ma1
       0.9412  -0.4657
 s.e.  0.0379   0.1046
 sigma^2 estimated as 0.5398:  log likelihood = -137.25,  aic = 280.49
>   library(lmtest)
>   coeftest(model) 
 z test of coefficients:
      Estimate Std. Error z value  Pr(>|z|)    
 ar1  0.941222   0.037872 24.8528 < 2.2e-16 ***
 ma1 -0.465740   0.104620 -4.4517 8.518e-06 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 ‘.’ 0.1 ‘ ’ 1

Notice that the z-values are different from the two softwares (and so the p-values). What happened? One clue is that Stata reports “OPG” standard error. Now looking into the documentation of arima function in both R and Stata it appears it is indeed the different method for standard error that causes problem. In the documentation of arima in R it says:

The variance matrix of the estimates is found from the Hessian of the log-likelihood, and so may only be a rough guide.

R Documentation

What does “rough guide” mean exactly here? Arguably all statistical modelling outside fields like image analysis are rough guides since there are a lot of uncertainties. Are there any “better guide”? Now looking at Stata documentation on ARIMA:

In one way, the results differ from most of Stata’s estimation commands: the standard error of the coefficients is reported as OPG Std. Err. The default standard errors and covariance matrix for arima estimates are derived from the outer product of gradients (OPG). This is one of three asymptotically equivalent methods of estimating the covariance matrix of the coefficients (only two of which are usually tractable to derive). Discussions and derivations of all three estimates can be found in Davidson and MacKinnon (1993), Greene (2012), and Hamilton (1994). Bollerslev, Engle, and Nelson (1994) suggest that the OPG estimates are more numerically stable in time-series regressions when the likelihood and its derivatives depend on recursive computations, which is certainly the case for the Kalman filter. To date, we have found no numerical instabilities in either estimate of the covariance matrix—subject to the stability and convergence of the overall model.

Most of Stata’s estimation commands provide covariance estimates derived from the Hessian of the likelihood function. These alternate estimates can also be obtained from arima by specifying the vce(oim) option.

Stata documentation

Clearly not only Stata documentation states exactly what the problem is, it include references if you want to research it further or include in your paper to justify the method used. This is one of many nice things in Stata – it will protect you from doing stupid things, and its documentation would tell you a full story behind it. Since they are developed under one developer, it will select the correct method for different models. The ‘lmtest’ package on the other hand does not care the model you fitted is a ‘lm’ or an ‘arima’. It is not the fault of this package – clearly if you are an expert in this field (time-series in this case) or thoroughly read the literature (since R documentation is limited) this issue would not be a problem. However, no one can be an expert in every statistical topic, however basic or major. One of the professors in medical statistics in my previous jobs does not know about k-means for example. Andrew Gelman is a Higgins Professor of Statistics at Columbia University and published extensively including many books, but maybe astonishingly he does not know anything about Generalized Method of Moments (GMM), a very common and Nobel-prize-winning method in econometrics.

My response to “When Is It Crucial to Standardize the Variables in a Regression Model?”

I came to this blog on Minitab the other day on LinkedIn: When Is It Crucial to Standardize the Variables in a Regression Model? To my great surprise, the author states that it is “when your regression model contains polynomial terms or interaction terms” because of “multicollinearity”.

You should standardize the variables when your regression model contains polynomial terms or interaction terms. While these types of terms can provide extremely important information about the relationship between the response and predictor variables, they also produce excessive amounts of multicollinearity.

This is in direct contrast to what other people suggested. For example, in this blog “When Can You Safely Ignore Multicollinearity?” (written by Paul Allison who is Professor of Sociology at the University of Pennsylvania, where he teaches  statistics) it was suggested that:

2. The high VIFs are caused by the inclusion of powers or products of other variables. If you specify a regression model with both x and x2, there’s a good chance that those two variables will be highly correlated. Similarly, if your model has x, z, and xz, both x and z are likely to be highly correlated with their product. This is not something to be concerned about, however, because the p-value for xz is not affected by the multicollinearity.  This is easily demonstrated: you can greatly reduce the correlations by “centering” the variables (i.e., subtracting their means) before creating the powers or the products. But the p-value for x2 or for xz will be exactly the same, regardless of whether or not you center. And all the results for the other variables (including the R2 but not including the lower-order terms) will be the same in either case. So the multicollinearity has no adverse consequences.

The author of the Minitab blog however insists that the coefficient for the lower order term changes so if a researcher is interested in this effect then centering should be done. The reason to the changes is because of multicollinearity he claimed.

Well it is easy to show that the coefficient for the lower order term does change after centering. Suppose this is our true model: y=ax^2+bx+c, and the mean of x is m. After centering x, the model would become y=a'(x-m)^2+b'(x-m)+c’. It can be easily shown that the above equation is equivalent to: y=a’x^2+(b’-2a’m)x+a’m^2+c’. It is then clear that the coefficient for the higher order term does not change, i.e. a’=a. The coefficient for the lower order term b’=b+am. So yes, the coefficient for the lower order term does change, even these are two equivalent models after and before centering and b value is correct on the original scale of x.

Why this is happening? Well, if we look at the effect of x on y it would become clear. Before centering, it is not difficult to show that dy/dx=2ax+b. So b is the effect of x on y when x=0. After centering, dy/dx=2a(x-m)+b’. Now it is clear that b’ is actually the effect when x=m, i.e. when x is at its mean value.

The author of the Minitab blog says the coefficient and its p-value for lower order term x changes. Of course it changes! He is effectively saying that the effect of x on y when x=0 is different than the effect of x on y when x is at its mean. Of course it is different! It is a parabola! He is comparing apples with oranges, which has nothing to do with multicollinearity.

Still, one may say, if I am interested in the effect of x on y when x is at its mean, then centering is a good method. This is true, but it is not “crucial” to do so. You can still derive the revised equation after fitting a model without centering the data!

It is also not very clear why it is so important to show the effect of x on y when x is at its mean. I understand that mean can be a better representative of a variable than 0, but the consequence of centering your data is that your results are not directly comparable to others. Say Researcher 1 and Researcher 2 conducted two separate studies using two separate samples. Unless their samples are both extremely large, then it is likely that mean value in Sample 1 is different than Sample 2. Say the former is 50, the latter is 60. Now if they center their data and do the model, then the coefficient for the lower term x represents: effect of x on y when x=50 vs. effect of x on y when x=60, respectively. Obviously they are not directly comparable. Without centering, the interpretation however is the same in both studies: the coefficient for the lower term x just represents the effect of x on y when x=0, which is comparable, regardless of differences in means of x in different samples. You see, centering does not give you better model at all but causes confusions and making modelling results more difficult to interpret. Also why are we so obsessed with mean anyway? If I model age against the chance to get a cancer, I would be less interested in the effect of mean age, but want to focus on old age, when most people get cancer.

Minitab seems to be a reasonably popular statistics software though I never used it. It is just disappointing that its blog is of such low quality and misleading.

On the complexity of codes in Stata and R

One interesting aspect I really appreciated about Stata is its brevity of codes. Generally I experience less complex and shorter codes in Stata compared to another popular statistical package R, for doing the same thing.

Here is an example. Recently I have been evaluating various features (i.e. independent variables) in a logistic regression model. Say I want to see the effects of all variables with names starting from “ps” on the dependent variable “case” (imagine this is a case/control study). To do this in Stata is quite simple and straightforward:

logit case ps*

Cannot simpler than that, right? How to do the same thing in R? The following are some options I know:

datanew=data[,c("case",grep("^ps", names(data),value=T))]
model=glm(case~., datanew, family="binomial")

or you can try this:
vars=grep("^ps", names(data), value=T)
formula=paste("case ~", paste(vars,collapse="+"))
model=glm(formula=formula, data, family="binomial")

or this:
vars=c("case",grep("^ps", names(data), value=T))
model=glm(case~., data=subset(data,select=vars),family="binomial")

We can see that in any case, you are looking at 3 to 4 functions (if you don’t count “c” as a function!), long code to achieve this. Unless I am missing something, this excess coding certainly makes it not a so pleasant process in R.