A recent blog post discussed an interesting position of R as one’s second language (“R Should Be Your Second Language (If It’s Not Already Your First)“). While I very much agree that R can be a great second language, I have my reservation for it being one’s first language.
We all know that it is easier to learn a new language if you already know a language. However R is difficult to learn to start with. This has been greatly discussed in various posts such as this and this. Another good read is Mitchell, M. N. (2007) Strategically using General Purpose Statistics Packages: A Look at Stata , SAS and SPSS, Technical Report Series, which you can found here. Although it mainly discusses Stata, SAS, and SPSS, it has a short section on R.
While it is not news that people complaining about difficult to learn R, one advantage of R that R users often boast is it has so many packages. Looking at CRAN page, at the time of writing, it states there are 14174 packages. Even considering many of them are overlapping in functionality and they are needed really because the base R is pretty limited, this is still an impressive number. While this is definitely an advantage, this also causes problems for learners.
Firstly users are easily overwhelmed and confused by the large number of packages. For example, the receiver operating characteristic (ROC) curve analysis is a common procedure and included as standard in statistic software such as Stata, SAS, and SPSS, but not in base R. So I searched by typing “ROC…” in RStudio trying to find a package to this – it returns 11 packages. Which one does ROC and which one is the commonly used one? Not all these packages does ROC, such as the “rococo” package. I tried another one but could not make sense of it. After some searches in Bing/Google, and some play and trials, it turns out a popular package ‘pROC’ (not in the list above as it does not start with ‘ROC’) seems best (it resembles the similar functions in Stata). If you are learning R as your first language, you will likely spend a lot of time trying to figure out what packages to use even for a very basic statistical procedure. Of course, if your statistical works are very narrow to limited statistical procedures and then I suppose you could do this kind of research once and this may be acceptable.
Another problem of these packages they do not integrate well. Let’s use linear regression as an example. Almost all statistical packages do linear regression via OLS. But what if I want to estimate the linear model using say Bayesian method (without having to fully specify the data distribution etc as in Stan or WinBUGS)? In Stata, this is effortless by simply adding a ‘bayes’ prefix:
regress y x1 x2 // OLS bayes: regress y x1 x2 // Bayesian method
In SAS, this is also fairly simple, by adding a ‘bayes’ statement:
proc genmod data=data; model y = x1 X2 / dist=normal; bayes; run;
However, you could not simply add a bayes option in ‘lm’, the standard linear model function (or glm function as in SAS), in R. Something like the following does not work.
lm(y ~ x1 + x2, data=data, bayes=T)
The closest thing I found to similar solution in Stata or SAS is the ‘MCMCregress’ function in ‘MCMCpack’ pakcage. Of course, it is your responsibility to do the homework and research to find this package and this particular function. At least it is lucky that MCMCregress has similar syntax as to lm. The cv.glmnet in glmnet package for example does not accept the usual formula but has to be specified as cv.glmnet(x, y, …) where x is the matrix of predictors. And in the example below, the arima function, though part of base R, again has different specification altogether: it has the form of arima(x, order = c(0L, 0L, 0L), … xreg = …), where bizarrely x is the dependent variable and covariates were specified as matrix after xreg.
Another integration problem is that you are vulnerable to statistic mistakes. This time we use ARIMA model as an example. Suppose we want to run this ARIMA(1,1,1) model in Stata:
. webuse wpi1 . arima wpi, arima(1,1,1) noconstant nolog ARIMA regression Sample: 1960q2 - 1990q4 Number of obs = 123 Wald chi2(2) = 1258.91 Log likelihood = -137.2468 Prob > chi2 = 0.0000
D.wpi | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- ARMA | ar | L1. | .9411698 .0296928 31.70 0.000 .8829731 .9993666 | ma | L1. | -.4656191 .0876255 -5.31 0.000 -.6373619 -.2938763 -------------+---------------------------------------------------------------- /sigma | .7347324 .0364759 20.14 0.000 .6632411 .8062238 Note: The test of the variance against zero is one sided, and the two-sided confidence interval is truncated at zero.
This can be easily done in the base R and we use the ‘lmtest’ package to obtain z-values for significant tests.
> library(readstata13) > data=read.dta13("http://www.stata-press.com/data/r14/wpi1.dta") > model=arima(ts(data$wpi),order=c(1,1,1)) > model Call: arima(x = ts(data$wpi), order = c(1, 1, 1)) Coefficients: ar1 ma1 0.9412 -0.4657 s.e. 0.0379 0.1046 sigma^2 estimated as 0.5398: log likelihood = -137.25, aic = 280.49 > library(lmtest) > coeftest(model) z test of coefficients:
Estimate Std. Error z value Pr(>|z|)ar1 0.941222 0.037872 24.8528 < 2.2e-16 *** ma1 -0.465740 0.104620 -4.4517 8.518e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 ‘.’ 0.1 ‘ ’ 1
Notice that the z-values are different from the two softwares (and so the p-values). What happened? One clue is that Stata reports “OPG” standard error. Now looking into the documentation of arima function in both R and Stata it appears it is indeed the different method for standard error that causes problem. In the documentation of arima in R it says:
The variance matrix of the estimates is found from the Hessian of the log-likelihood, and so may only be a rough guide.R Documentation
What does “rough guide” mean exactly here? Arguably all statistical modelling outside fields like image analysis are rough guides since there are a lot of uncertainties. Are there any “better guide”? Now looking at Stata documentation on ARIMA:
In one way, the results differ from most of Stata’s estimation commands: the standard error of the coefficients is reported as OPG Std. Err. The default standard errors and covariance matrix for arima estimates are derived from the outer product of gradients (OPG). This is one of three asymptotically equivalent methods of estimating the covariance matrix of the coefficients (only two of which are usually tractable to derive). Discussions and derivations of all three estimates can be found in Davidson and MacKinnon (1993), Greene (2012), and Hamilton (1994). Bollerslev, Engle, and Nelson (1994) suggest that the OPG estimates are more numerically stable in time-series regressions when the likelihood and its derivatives depend on recursive computations, which is certainly the case for the Kalman filter. To date, we have found no numerical instabilities in either estimate of the covariance matrix—subject to the stability and convergence of the overall model.
Most of Stata’s estimation commands provide covariance estimates derived from the Hessian of the likelihood function. These alternate estimates can also be obtained from arima by specifying the vce(oim) option.Stata documentation
Clearly not only Stata documentation states exactly what the problem is, it include references if you want to research it further or include in your paper to justify the method used. This is one of many nice things in Stata – it will protect you from doing stupid things, and its documentation would tell you a full story behind it. Since they are developed under one developer, it will select the correct method for different models. The ‘lmtest’ package on the other hand does not care the model you fitted is a ‘lm’ or an ‘arima’. It is not the fault of this package – clearly if you are an expert in this field (time-series in this case) or thoroughly read the literature (since R documentation is limited) this issue would not be a problem. However, no one can be an expert in every statistical topic, however basic or major. One of the professors in medical statistics in my previous jobs does not know about k-means for example. Andrew Gelman is a Higgins Professor of Statistics at Columbia University and published extensively including many books, but maybe astonishingly he does not know anything about Generalized Method of Moments (GMM), a very common and Nobel-prize-winning method in econometrics.