Parallelization in rcpp via OpenMP

Obtaining good computational performance in R can be a frustrating experience.  As with most interpreted languages, we are taught to eschew loops in favor of vectorized operations, which requires learning the somewhat byzantine suite of functions that comprise the “apply” family.  After rewriting your code to utilize these procedures, you may experience a 1.5x-2x speedup, which is substantial, but still can be lacking compared to R’s compiled counterparts.

Vectorized code is amenable to parallelization through packages such as “mcapply,” but if you are serious about performance, you will likely be told that “You shouldn’t be programming in R at all,” and will instead be directed to outsource your computationally intensive code to a compiled language, using an interface such as “Rcpp.”  Of course, since C++ is compiled, “for” loops are not a problem, so you will have to convert your efficient vectorized code back to a naive nested loop structure.  However, upon doing so, you will likely notice a 5-10x performance improvement, even over parallelized vectorized R code.

Rcpp is particularly useful if you have to construct your own algorithms from scratch.  My research involves applying regularization methods to multivariate time series.  Since R’s existing packages for Lasso and its structured counterparts offer limited multivariate support and are not amenable to my cross validation procedure, I am usually forced to write my own algorithms.  The example that I will be working with applies a Lasso penalty to a vector autoregression via coordinate descent, similar to the procedure described in Friedman et al. (2010).

The initial performance gains by converting my algorithm to Rcpp can be seen in the following benchmark plot, which compares the performance of three variants of my procedures: one naively coded in R, one taking advantage of vectorized operations, and one coded using Rcpp

InitialBenchmarks

It is not incredibly straightforward to parallelize from within Rcpp.  From my experience, the type of parallelization of most interest to statisticians typically consists of a for loop with independent iterates; a type of problem that is considered “embarrassingly parallel.”  For this type of parallelization, the OpenMP framework is ideal, as it is included as a default library on most Linux distributions and requires little modification to your code.

The documentation regarding the use of OpenMP within R is sparse.   Writing R Extensions contains a brief section on OpenMP mentioning the R has “some” support for it and details how to use it in a package.  Dirk Eddelbuettel’s website contains a toy example using OpenMP with Rcpp via the “inline”  package, but this package is not usable if you are required to reference more than 1 function; this requires the use of the “sourceCpp.”  There are a few other examples floating around, but they are all very rudimentary.

As a result, I will offer a brief tutorial on how to enable OpenMP support within Rcpp for a non-trivial program.  Specifically, I will parallelize the for loop in my Lasso-VAR implementation which cycles over different penalty parameters.  These iterations are independent, but larger penalties will set more variables to zero, requiring fewer iterations, hence less time to solve, so some sort of load balancing should be taking into account.


 

Pre-requisites:

  • A modern Linux OS with OpenMP support 

In order to follow this tutorial, your operating system must support OpenMP.  Per an excerpt from “Writing R Extensions:”

[V]ersion 3.0 (May 2008) is supported by recent versions of the Linux, Windows and Solaris platforms, but portable packages cannot assume that end users have recent versions. The compilers used on OS X 10.6 (‘Snow Leopard’) had partial support for OpenMP 2.5, but this is not enabled in the CRAN build of R. (Apple have discontinued support for those compilers, and their alternative as from OS X 10.9, clang, has no OpenMP support. A project to add it has been announced at http://clang-omp.github.io/, but it is unknown when or even if the Apple builds will incorporate it.)

The performance of OpenMP varies substantially between platforms. Both the Windows and the Apple OS X (where available) implementations have substantial overheads and are only beneficial if quite substantial tasks are run in parallel.

Unfortunately, this makes OpenMP-based programs difficult to justify for a package development  framework.  Nonetheless, I have tested my programs on three modern Linux distributions (Mint, Fedora, and Ubuntu), and I have realized substantial performance increases.  Throughout the rest of this tutorial, I will assume that you are using a modern Linux distribution with OpenMP 3.0 support.

  • The most up-to-date versions and some familiarity with R, Rcpp, and Rcpp Armadillo
  • Functional code called through “sourceCpp” that contains a “for” loop in which to parallelize (optional)

 

Preliminaries:

In order for the compiler to recognize openMP, we need to include the OpenMP header flag.  Additionally, Rcpp has a plugin that will automatically add the openMP compiler flags.  Add the following to the top of your program.

#include <omp.h>
// [[Rcpp::plugins(openmp)]]

Next, recall that R is single threaded, which means that any of the variables read into your R function are native R or Rcpp data structures (e.g. lists, NumericVectors, NumericMatrix, etc), will need to be converted to either native C++ objects or (preferably) Armadillo objects, which  have multi-thread support.

Then, identify the loop in which you want to parallelize and add the following lines above it

omp_set_num_threads(int)
#pragma omp parallel for

The first line selects the number of cores to use.  This should not exceed the number of cores in your system.  At this point, if your code is simple enough, it might run, so go ahead and give it a try.  If your program appears to be successful, try running it several hundred times in a function like “microbenchmark” to test performance and ensure that there are no potential errors.

Debugging

Debugging parallel programs can be very difficult, as many errors occur sporadically; only when certain unknown conditions are triggered.  I’ve found that two types of errors tend to occur most often.

  • Stack Imbalance/Overflow

As primarily an R programmer, I tend to take Rcpp and Armadillo’s automatic memory management for granted.  OpenMP seems to require a bit more active memory management.  This doesn’t mean that you have to explicitly make calls to “malloc” and “free,” but you should avoid declaring large objects within your parallel loop.  For example my original lasso kernel function initialized the following variables

int n=Y1.nrow(),k=Y1.ncol();
arma::mat Y(Y1.begin(),n,k,false);
int n1=Z1.nrow(),k1=Z1.ncol();
arma::mat Z(Z1.begin(),n1,k1,false);
int n2=B1.nrow(),k2=B1.ncol();
arma::mat B(B1.begin(),n2,k2,false);
arma::mat BOLD=B;
arma::mat R=zeros<arma::mat>(n,k);
arma::colvec ZN = as<arma::colvec>(Znorm);
arma::colvec Z2;
arma::colvec G1;
NumericVector G;
arma::uvec m;

First, I have since realized that I can read armadillo objects directly from R, so it is pointless to convert Rcpp objects to their armadillo counterparts.  If I absolutely need to do so, I should do it outside of the parallel region in order for OpenMP to better allocate memory to each thread.  Second, I am storing far too many variables compared to what I actually need.

My inner loop is

for(int i = 0; i < k2; ++i)
{
m=ind(k2,i);
R=Y-B.cols(m)*Z.rows(m);
Z2=trans(Z.row(i));
G1=R*Z2;
G=as<NumericVector>;(wrap(G1));
B.col(i)=ST4(G,gam)/ZN[i];
}

For some reason, I convert an arma colvec into a NumericVector because my original elementwise soft-threshold function ST4 requires a NumericVector input.  Though useful for debugging purposes, there is no point to store all of these matrices.  Instead, I could reorganize it as follows

for(int i = 0; i < k2; ++i)
{
m=ind(k2,j);
B.col(j)=ST4((Y-B.cols(m)*Z.rows(m))*trans(Z.row(j)),gam)/ZN[j];
}

This accomplishes the same task while substantially reducing storage costs.

If you absolutely have to define large data structures within a parallel region and you are getting stack imbalance or overflow errors, you may want to try adjusting the OpenMP stacksize via the environmental variable

setenv("OMP_STACKSIZE","XM",1);

Replace X with the amount you want to allocate; the default is 4 megabytes.

  • Segmentation Fault/Memory Not Mapped

These types of errors tend to occur for me when I am performing some sort of indexing operation, like accessing non-contiguous elements of a matrix or vector.  There are a few things that I found that work to prevent them.  First, as mentioned in the previous section, for memory management purposes, I initialize most variables outside of the parallel region.  This requires that I explicitly designate “private” and “shared” variables.  A shared variables exist only in one memory location, which all threads will read or write to.  An example would be an array in which you store results from independent iterates in the parallel loop.  Shared variables can be useful because they save memory, but they can lead to errors if more than one thread is accessing the same location at the same time.

Most variables will end up being private, in which a new variable is declared for each thread.  This results in increased overhead, but prevents memory conflicts.  On another note, constant data structures, such as data matrices should be designated as such (simply add “const” in front of the variable name) as they can be accessed by all threads without the “shared” designation.

My final parallel for loop looks like:

#pragma omp parallel for shared(bcube,bcube2,b2,gammgrid) private(i,gam,B1F2,B1,m,nu) default(none) schedule(auto)

The bcube objects are armadillo cubes (i.e. three-dimensional arrays) which store the initializations for the coefficient matrix as well as the final result, which naturally fit under the shared variable designation.

The “schedule” option takes into account load balancing via a scheduling strategy, per the textbook Parallel Programming for Multicore and Cluster Systems, OpenMP has the following scheduling options:

  • Static, “blocksize-” assigns iterates of size “blocksize” to threads in a “round-robin” fashion.
  • Dynamic, “blocksize-” distributes blocks of size “blocksize” to threads as soon as a thread finishes with a previously assigned block.
  • Guided, “blocksize-” dynamic scheduling of blocks with decreasing size.
  • Auto: Delegates scheduling to the compiler/runtime system.

The default option of static does not work particularly well when the workload differs substantially across iterates, as my example does.  Dynamic scheduling might be preferred in this case and one can play around with the blocksize to experience performance gains.  Typically, I find a slight performance increase by setting an automatic scheduling strategy, leaving scheduling in the hands of the compiler.

If you still experience segfault related errors after trying the above, the next step is to start placing your code in critical regions.  A critical region is a region of the code that can only be executed by one thread at a time.  Obviously, you don’t want to have any computationally demanding operations in a critical region, as it must be run in serial, but sometimes it cannot be avoided.  My code contains a small critical region in which I subset non-contiguous columns of  the coefficient matrix.  I believe that the issue likely stems from my use of an IntegerVector (Rcpp data structure) to construct the index.  However, since it is a very simple operation, it does not really degrade performance, though I will look into an alternative for future versions.  The syntax for a critical region is

#pragma omp critical
{

}

Note that the brackets must be on the next line; for some reason C++ does not recognize the bracket if it is on the same line as #pragma.  If the conflicting operation is atomic (simple addition or subtraction), the “#pragma omp atomic” operation can be used instead, with less of a performance degrade.

Finally, after sorting out these issues, the program runs with a considerable performance improvement as seen in the following figure.

OpenMPSpeedup.1

I am under the impression that the long tails are the result of cache misses based on accessing non-contiguous elements of my data matrix which I will address in a subsequent post on profiling.  The final version of my code will be posted on GitHub in the coming days.