Code Added

Code from my first blog post Parallelization in Rcpp via OpenMP is now available on github.  The serial code should run provided you install the package Rcpp, RcppArmadillo, and microbenchmark.  Regarding the parallel code, Linux operating systems have OpenMP support by default so it should run without issues.  I think it will also work on Windows, but as stated in the blog post, it will probably not run on a Mac without installing a new compiler. 

Advertisements

Linking R with Intel’s math kernel libraries

March 13, 2015:  I purchased a new computer, so I needed to re-link R with MKL.  I’ve modified the following post with a few corrections.

Matrix operations are usually the most computationally intensive task in a statistical program.  Fortunately, R and most modern programming languages rely on highly optimized matrix multiplication algorithms through an external FORTRAN library such as BLAS or LINPACK, which have been continuously refined since the 1970s.

Though R comes pre-installed with these highly optimized libraries, per the installation and administration manual, its default internal BLAS implementation is described as “well-tested and will be adequate for most uses of R.”  However, it does not take advantage of recent advances utilized by external libraries such as OpenBLAS, ATLAS, and MKL, which have multithreaded supported and highly optimized cache utlization.

Using the default BLAS libraries, R is still substantially slower than its proprietary counterpart Matlab.  A major reason for this discrepancy is due to Matlab’s use of Intel’s Math Kernel Libraries for BLAS and LAPACK operations.  By offering parallel support for matrix operations and utilizing Intel architecture-specific optimizations, the MKL result in Matlab dominating R in most benchmark tests  (see, for example, this performance comparison).

However, since Intel provides both its compilers (for C and C++) and its MKL free for students (provided that they are not compensated for software development) and build R from source using Intel’s compilers and linking to Intel’s MKL.  Alternatively, Revolution R offers a free academic license and links to an older version of MKL, but it appears to only have a Linux binary for Red Hat/Fedora.

I found building R from source and linking it with MKL to be a somewhat onerous process, with existing documentation being outdated, confusing, and often misleading, so I will outline my procedure in as much detail as possible, in hopes that it may aid someone in the future.

I’m working from a fresh install of Linux Mint 17, but I am assuming that these instructions should work for any Linux-based OS.  It’s also recommended that you have an Intel chipset in order to experience the most gains from the MKL.  The computer that I used for this tutorial is a Samsung Series 9 with an Intel i5 and 4GB of RAM.

Step 1: Download and install the Intel® C++ Studio XE for Linux

The Intel C++ studio comes with MKL, a C and C++ compiler, and a few other useful things such as Vtune for profiling (though it will not work for profiling C++ code within R, I will write a post on profiling in the future).

You will also need to make sure that you have the gcc and g++ libraries installed for these libraries to function properly.  You will get a warning if they are not available, though you can safely ignore the warning for the missing 32-bit libraries if you are on a 64-bit machine.  Most of the required files can be installed via the command:

sudo apt-get install build-essential

Step 2: Pre-requisites for building R from source

Several libraries and other dependencies are needed to build R from source.  First, install a fortran compiler.  I didn’t have a license for Intel’s fortran compiler “ifort,” so I installed GNU’s gfortran.  Next, I needed to install the X11-windows system, and the readlines library, which supposedly help improve the aesthetics of R’s terminal display.  Finally, Java is needed for an external “RJava” package, which is built in the default install.  With the proper configure flags, you can probably get around installing some of these libraries, but I didn’t feel like experimenting.  The commands to install these packages are:

sudo apt-get install gfortran
sudo apt-get install libx11-dev
sudo apt-get install xorg-dev
sudo apt-get install libreadline-dev

sudo apt-get install default-jdk

sudo apt-get install libcairo2-dev

If you find that you need another library when compiling something, a good rule-of-thumb is to type “sudo apt-get install [dependency-name]-dev, which generally will provide the needed dependencies for compilation.

3. Download R’s source and edit the configuration files

R’s source code is available here.  Upon downloading it and extracting it, open the file config.site and edit the following

CC='icc -std=c99'
CFLAGS='-O3 -ipo -xavx -openmp'
CXX='icpc'
CXXFLAGS='-O3 -ipo -xavx -openmp'

This is telling R to use Intel’s compilers with the appropriate flags (-03 turns on several optimization flags, -ipo is for inter-procedural optimization, -xavx intel-specific optimization, and openmp for openmp support).

4. Create environmental variables

Next, you need to tell R where MKL and Intel’s compilers are located is located.  Your installation directory should have several bash variables that you will need to run every time that you want to use MKL or intel’s compilers.  For convenience, you can add the following three lines to your ~/.bashrc file, so that they are automatically linked in each bash session.

source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64
source /opt/intel/composer_xe_2013_sp1.1.106/mkl/bin/mklvars.sh intel64
source /opt/intel/composer_xe_2013_sp1.1.106/bin/iccvars.sh intel64
export AR="xiar"
export LD="xild"

Note that your path may be different depending on the version of MKL and Intel composer that you install, but it should be along these lines.  There is an MKL link tool included in the MKL directory which can help you provide the link if you run into problems.

Finally, switch into root and run the following commands

source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64
source /opt/intel/composer_xe_2013_sp1.1.106/mkl/bin/mklvars.sh intel64
source /opt/intel/composer_xe_2013_sp1.1.106/bin/iccvars.sh intel64
export AR="xiar"
export LD="xild"

export MAIN_LDFLAGS='-openmp'
OMP_LIB_PATH=/opt/intel/composer_xe_2013_sp1.1.106/compiler/lib
MKL_LIB_PATH=/opt/intel/composer_xe_2013_sp1.1.106/mkl/lib/intel64
MKL=" -L${MKL_LIB_PATH} -L${OMP_LIB_PATH} \
-Wl,--start-group \
-lmkl_gf_lp64 \
-lmkl_intel_thread \
-lmkl_core \
-lm \
-Wl,--end-group \
-liomp5 -lpthread"
./configure --with-lapack --with-cairo --with-blas="$MKL" --build="x86_64-linux-gnu" --host="x86_64-linux-gnu" --enable-R-shlib

Previously, I did not include the lines ‘–enable-R-shlib,’ and ‘–with–cairo,’ which enable shared libraries (which are required to use external applications, such as Rstudio) and enable better graphics respectively.  To enable shared libraries, I referenced the following stack overflow post.  Note that some tutorials use the -lmkl_intel_lp64 flag instead of -lmkl_gf_lp64.  According to Ying H at Intel 

libmkl_gf_lp64  is for LP64 interface library for the GNU Fortran compilers and libmkl_intel_lp64 is for intel compilers.

So if, like me, you don’t have ifort, be sure to use the GNU fortran flag.

I re-ran all of the source variables in root because I only defined them in my personal ~/.bashrc user profile; they will not be defined for root or sudo.  I could have edited the file /etc/.bashrc, but I did not know that at the time.

If it worked correctly, you should see something along the lines of

R is now configured for x86_64-unknown-linux-gnu

Source directory: .
Installation directory: /usr/local

C compiler: icc -std=c99 -O3 -xavx -ipo -openmp
Fortran 77 compiler: gfortran -g -O2

C++ compiler: icpc -O3 -ipo -xavx -openmp
C++ 11 compiler: icpc -std=c++11 -O3 -ipo -xavx -openmp
Fortran 90/95 compiler: gfortran -g -O2
Obj-C compiler:

Interfaces supported: X11
External libraries: readline, BLAS(generic), LAPACK(in blas)
Additional capabilities: PNG, NLS
Options enabled: R profiling

Recommended packages: yes

If you don’t see this, check your config log. If you see an error in the following lines

conftest.c:210: warning: implicit declaration of function 'dgemm_'
configure:29096: $? = 0
configure:29103: result: yes
configure:29620: checking whether double complex BLAS can be used
configure:29691: result: yes

Something went wrong and R will ignore your MKL link and use its internal BLAS libraries.

 I adapted most of the configuration step from Intel’s tutorial except I found that I had to add an -lm flag to $MKL, or it wouldn’t recognize the math symbols in MKL. Myself and several others (you’ll find them if you do a google search) had problems with the use of double complex BLAS, but for me, they seemed to go away when I ran the configure step as root.  I have no explanation as to why.

Next, run make and make install while still root.  It should take about 10-15 minutes.  Once it finishes, run the command:


R CMD ldd ./bin/exec/R

And you should see:

linux-vdso.so.1 => (0x00007fff723d4000)
libmkl_gf_lp64.so => /opt/intel/composer_xe_2013_sp1.1.106/mkl/lib/intel64/libmkl_gf_lp64.so (0x00007f65cfa50000)
libmkl_intel_thread.so => /opt/intel/composer_xe_2013_sp1.1.106/mkl/lib/intel64/libmkl_intel_thread.so (0x00007f65cea92000)
libmkl_core.so => /opt/intel/composer_xe_2013_sp1.1.106/mkl/lib/intel64/libmkl_core.so (0x00007f65cd3d4000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f65cd0ab000)
libiomp5.so => /opt/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libiomp5.so (0x00007f65ccd90000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f65ccb72000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f65cc858000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f65cc61c000)
libreadline.so.6 => /lib/x86_64-linux-gnu/libreadline.so.6 (0x00007f65cc3d6000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f65cc1cd000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f65cbfc9000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f65cbdb3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f65cb9ec000)
/lib64/ld-linux-x86-64.so.2 (0x00007f65d0196000)
libtinfo.so.5 => /lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007f65cb7c3000)

Complications

I realized a few annoyances after building R from source. I typically run R from Emacs via Emacs Speaks Statistics, and when I first tried to open R from Emacs it would crash.  This is due to the mkl and icc environmental variables only being set through in bash, and not when I launch a program.  This means that if I want to use Emacs speaks statistics, I need to execute emacs from the terminal.

Additionally, I wanted to install the “Rcpp” package and I encountered the error: “catastrophic error: cannot open source file “bits/c++config.h,” but following the advice of this stackoverflow post, I was able to resolve it.

Perhaps the most annoying issue was the poor quality of R’s default graphics, which produced plots like this:

Rplot

I posted a question on StackOverflow, but didn’t get much help, so I did some research.  It turns out that I need to install libcairo 2 and ./configure with the option –with-cairo.  Then, edit your .Rprofile in your home directory to include:

setHook(packageEvent(“grDevices”, “onLoad”),
function(…) grDevices::X11.options(type=’cairo’))
options(device=’x11′)

After doing so, your graphics should appear as they normal would after installing a package from the repository.

There may be more complications in the future in installing packages that require compiled code, but my initial performance gains appear quite substantial.  I routinely see R executing working at 400% CPU utilization when running some of my computationally intensive procedures.  I will run some benchmarks in the future, but I’ve noticed performance gains roughly similar to those of the Revolution R’s blog, with even greater gains with my parallel code in Rcpp, which is automatically linked to MKL.

Parallelization in rcpp via OpenMP

Obtaining good computational performance in R can be a frustrating experience.  As with most interpreted languages, we are taught to eschew loops in favor of vectorized operations, which requires learning the somewhat byzantine suite of functions that comprise the “apply” family.  After rewriting your code to utilize these procedures, you may experience a 1.5x-2x speedup, which is substantial, but still can be lacking compared to R’s compiled counterparts.

Vectorized code is amenable to parallelization through packages such as “mcapply,” but if you are serious about performance, you will likely be told that “You shouldn’t be programming in R at all,” and will instead be directed to outsource your computationally intensive code to a compiled language, using an interface such as “Rcpp.”  Of course, since C++ is compiled, “for” loops are not a problem, so you will have to convert your efficient vectorized code back to a naive nested loop structure.  However, upon doing so, you will likely notice a 5-10x performance improvement, even over parallelized vectorized R code.

Rcpp is particularly useful if you have to construct your own algorithms from scratch.  My research involves applying regularization methods to multivariate time series.  Since R’s existing packages for Lasso and its structured counterparts offer limited multivariate support and are not amenable to my cross validation procedure, I am usually forced to write my own algorithms.  The example that I will be working with applies a Lasso penalty to a vector autoregression via coordinate descent, similar to the procedure described in Friedman et al. (2010).

The initial performance gains by converting my algorithm to Rcpp can be seen in the following benchmark plot, which compares the performance of three variants of my procedures: one naively coded in R, one taking advantage of vectorized operations, and one coded using Rcpp

InitialBenchmarks

It is not incredibly straightforward to parallelize from within Rcpp.  From my experience, the type of parallelization of most interest to statisticians typically consists of a for loop with independent iterates; a type of problem that is considered “embarrassingly parallel.”  For this type of parallelization, the OpenMP framework is ideal, as it is included as a default library on most Linux distributions and requires little modification to your code.

The documentation regarding the use of OpenMP within R is sparse.   Writing R Extensions contains a brief section on OpenMP mentioning the R has “some” support for it and details how to use it in a package.  Dirk Eddelbuettel’s website contains a toy example using OpenMP with Rcpp via the “inline”  package, but this package is not usable if you are required to reference more than 1 function; this requires the use of the “sourceCpp.”  There are a few other examples floating around, but they are all very rudimentary.

As a result, I will offer a brief tutorial on how to enable OpenMP support within Rcpp for a non-trivial program.  Specifically, I will parallelize the for loop in my Lasso-VAR implementation which cycles over different penalty parameters.  These iterations are independent, but larger penalties will set more variables to zero, requiring fewer iterations, hence less time to solve, so some sort of load balancing should be taking into account.


 

Pre-requisites:

  • A modern Linux OS with OpenMP support 

In order to follow this tutorial, your operating system must support OpenMP.  Per an excerpt from “Writing R Extensions:”

[V]ersion 3.0 (May 2008) is supported by recent versions of the Linux, Windows and Solaris platforms, but portable packages cannot assume that end users have recent versions. The compilers used on OS X 10.6 (‘Snow Leopard’) had partial support for OpenMP 2.5, but this is not enabled in the CRAN build of R. (Apple have discontinued support for those compilers, and their alternative as from OS X 10.9, clang, has no OpenMP support. A project to add it has been announced at http://clang-omp.github.io/, but it is unknown when or even if the Apple builds will incorporate it.)

The performance of OpenMP varies substantially between platforms. Both the Windows and the Apple OS X (where available) implementations have substantial overheads and are only beneficial if quite substantial tasks are run in parallel.

Unfortunately, this makes OpenMP-based programs difficult to justify for a package development  framework.  Nonetheless, I have tested my programs on three modern Linux distributions (Mint, Fedora, and Ubuntu), and I have realized substantial performance increases.  Throughout the rest of this tutorial, I will assume that you are using a modern Linux distribution with OpenMP 3.0 support.

  • The most up-to-date versions and some familiarity with R, Rcpp, and Rcpp Armadillo
  • Functional code called through “sourceCpp” that contains a “for” loop in which to parallelize (optional)

 

Preliminaries:

In order for the compiler to recognize openMP, we need to include the OpenMP header flag.  Additionally, Rcpp has a plugin that will automatically add the openMP compiler flags.  Add the following to the top of your program.

#include <omp.h>
// [[Rcpp::plugins(openmp)]]

Next, recall that R is single threaded, which means that any of the variables read into your R function are native R or Rcpp data structures (e.g. lists, NumericVectors, NumericMatrix, etc), will need to be converted to either native C++ objects or (preferably) Armadillo objects, which  have multi-thread support.

Then, identify the loop in which you want to parallelize and add the following lines above it

omp_set_num_threads(int)
#pragma omp parallel for

The first line selects the number of cores to use.  This should not exceed the number of cores in your system.  At this point, if your code is simple enough, it might run, so go ahead and give it a try.  If your program appears to be successful, try running it several hundred times in a function like “microbenchmark” to test performance and ensure that there are no potential errors.

Debugging

Debugging parallel programs can be very difficult, as many errors occur sporadically; only when certain unknown conditions are triggered.  I’ve found that two types of errors tend to occur most often.

  • Stack Imbalance/Overflow

As primarily an R programmer, I tend to take Rcpp and Armadillo’s automatic memory management for granted.  OpenMP seems to require a bit more active memory management.  This doesn’t mean that you have to explicitly make calls to “malloc” and “free,” but you should avoid declaring large objects within your parallel loop.  For example my original lasso kernel function initialized the following variables

int n=Y1.nrow(),k=Y1.ncol();
arma::mat Y(Y1.begin(),n,k,false);
int n1=Z1.nrow(),k1=Z1.ncol();
arma::mat Z(Z1.begin(),n1,k1,false);
int n2=B1.nrow(),k2=B1.ncol();
arma::mat B(B1.begin(),n2,k2,false);
arma::mat BOLD=B;
arma::mat R=zeros<arma::mat>(n,k);
arma::colvec ZN = as<arma::colvec>(Znorm);
arma::colvec Z2;
arma::colvec G1;
NumericVector G;
arma::uvec m;

First, I have since realized that I can read armadillo objects directly from R, so it is pointless to convert Rcpp objects to their armadillo counterparts.  If I absolutely need to do so, I should do it outside of the parallel region in order for OpenMP to better allocate memory to each thread.  Second, I am storing far too many variables compared to what I actually need.

My inner loop is

for(int i = 0; i < k2; ++i)
{
m=ind(k2,i);
R=Y-B.cols(m)*Z.rows(m);
Z2=trans(Z.row(i));
G1=R*Z2;
G=as<NumericVector>;(wrap(G1));
B.col(i)=ST4(G,gam)/ZN[i];
}

For some reason, I convert an arma colvec into a NumericVector because my original elementwise soft-threshold function ST4 requires a NumericVector input.  Though useful for debugging purposes, there is no point to store all of these matrices.  Instead, I could reorganize it as follows

for(int i = 0; i < k2; ++i)
{
m=ind(k2,j);
B.col(j)=ST4((Y-B.cols(m)*Z.rows(m))*trans(Z.row(j)),gam)/ZN[j];
}

This accomplishes the same task while substantially reducing storage costs.

If you absolutely have to define large data structures within a parallel region and you are getting stack imbalance or overflow errors, you may want to try adjusting the OpenMP stacksize via the environmental variable

setenv("OMP_STACKSIZE","XM",1);

Replace X with the amount you want to allocate; the default is 4 megabytes.

  • Segmentation Fault/Memory Not Mapped

These types of errors tend to occur for me when I am performing some sort of indexing operation, like accessing non-contiguous elements of a matrix or vector.  There are a few things that I found that work to prevent them.  First, as mentioned in the previous section, for memory management purposes, I initialize most variables outside of the parallel region.  This requires that I explicitly designate “private” and “shared” variables.  A shared variables exist only in one memory location, which all threads will read or write to.  An example would be an array in which you store results from independent iterates in the parallel loop.  Shared variables can be useful because they save memory, but they can lead to errors if more than one thread is accessing the same location at the same time.

Most variables will end up being private, in which a new variable is declared for each thread.  This results in increased overhead, but prevents memory conflicts.  On another note, constant data structures, such as data matrices should be designated as such (simply add “const” in front of the variable name) as they can be accessed by all threads without the “shared” designation.

My final parallel for loop looks like:

#pragma omp parallel for shared(bcube,bcube2,b2,gammgrid) private(i,gam,B1F2,B1,m,nu) default(none) schedule(auto)

The bcube objects are armadillo cubes (i.e. three-dimensional arrays) which store the initializations for the coefficient matrix as well as the final result, which naturally fit under the shared variable designation.

The “schedule” option takes into account load balancing via a scheduling strategy, per the textbook Parallel Programming for Multicore and Cluster Systems, OpenMP has the following scheduling options:

  • Static, “blocksize-” assigns iterates of size “blocksize” to threads in a “round-robin” fashion.
  • Dynamic, “blocksize-” distributes blocks of size “blocksize” to threads as soon as a thread finishes with a previously assigned block.
  • Guided, “blocksize-” dynamic scheduling of blocks with decreasing size.
  • Auto: Delegates scheduling to the compiler/runtime system.

The default option of static does not work particularly well when the workload differs substantially across iterates, as my example does.  Dynamic scheduling might be preferred in this case and one can play around with the blocksize to experience performance gains.  Typically, I find a slight performance increase by setting an automatic scheduling strategy, leaving scheduling in the hands of the compiler.

If you still experience segfault related errors after trying the above, the next step is to start placing your code in critical regions.  A critical region is a region of the code that can only be executed by one thread at a time.  Obviously, you don’t want to have any computationally demanding operations in a critical region, as it must be run in serial, but sometimes it cannot be avoided.  My code contains a small critical region in which I subset non-contiguous columns of  the coefficient matrix.  I believe that the issue likely stems from my use of an IntegerVector (Rcpp data structure) to construct the index.  However, since it is a very simple operation, it does not really degrade performance, though I will look into an alternative for future versions.  The syntax for a critical region is

#pragma omp critical
{

}

Note that the brackets must be on the next line; for some reason C++ does not recognize the bracket if it is on the same line as #pragma.  If the conflicting operation is atomic (simple addition or subtraction), the “#pragma omp atomic” operation can be used instead, with less of a performance degrade.

Finally, after sorting out these issues, the program runs with a considerable performance improvement as seen in the following figure.

OpenMPSpeedup.1

I am under the impression that the long tails are the result of cache misses based on accessing non-contiguous elements of my data matrix which I will address in a subsequent post on profiling.  The final version of my code will be posted on GitHub in the coming days.