Creating a virtual cluster in R using Amazon ec2

I was recently asked by my department to evaluate Amazon’s elastic compute cloud (ec2) to see if it could serve as a viable alternative to our aging cluster.  There is a great deal of documentation on setting up R with ec2, but it tends to be either far too simplistic or needlessly complicated.  I find myself in need of a middle ground; I am comfortable using Linux from the command line, so I have no need to use something like RStudio server, but my primary purpose in using ec2 is to run “embarrassingly parallel” simulations, so I don’t need to work with distributed memory or anything of that sort, so setting up Starcluster would be overkill.

This brief tutorial provided me with the information I needed, which I slightly modified to fit my specific needs.  I will briefly expand upon it and offer a few comments.

Pre-requisites:

I’m going to assume that you are working from Linux; I assume that the instructions are roughly the same on a mac.  Other than this, you don’t really need anything else aside from ssh, which should be included in a default install on every modern linux distribution.

Step 1: Create an Account and set up your head node

Go here in order to create an account, you will need to verify your identity and enter your billing information.  After doing so, you should end up at the ec2 Dashboard, at which point, you  should click “Launch Instance.”  From what I can tell, an instance is basically a virtual machine with varying amounts of memory (dependent on cost) from which you have root access.  Unlike a traditional cluster, you are not required to go through administrators in order to install programs or edit configuration files, which to me is very advantageous.  Additionally, you don’t really need to worry about cluster traffic, as a new virtual environment is created each time you launch your instance.  There were a few times where I was unable to launch my instance because of traffic issues in my region, but I was able to get around this by creating a new instance in another region.  The pricing options are listed here.  I use the “US East” region, as it is the closest to me geographically.

In your first step, you will chose an Amazon Machine Image (AMI).  You have the option of using one that is already pre-configured with R and Rstudio, or creating your own.  I chose to create new AMI with Ubuntu server edition, as I am most familiar with the Ubuntu framework.

With regard to what instance type to choose, I would consider a general purpose one.  Unless you are doing something very memory intensive, there is no need to use a “compute optimized” cluster, as they are considerably more expensive.  Moreover, one can create a cluster from several micro-instances, which cost just over $.01 per hour to run.  For an initial instance, I chose the “m3.xlarge,” which will serve as a “head node” for my cluster.  I haven’t noticed performance degrades with cheaper instances.

Just set one instance for now and click through the next through prompts, possibly decreasing the EBS storage from 8gb if you don’t think you will need that much space (you are charged about $.10 per gigabyte per hour).  Also, edit the security settings to open port 10187, as per the cited instructions.  Name this security group something you will remember.  You are going to be asked to create a key pair, which, in lieu of a password, is how you will access your instance.  Choose to create a new Keypair, download it and store it in a safe place.  Then, you will need to edit (or create a new file “~/.ssh/config”  of the following form

Host Public DNS address of your instance (available from the console)
User ubuntu  (username is ubuntu if it is an ubuntu machine, it will vary for other OSs)
Hostname Public DNS address again (not sure if both are necessary)
Port 22
IdentityFile “/path/to/key”

Now, you will be able to connect simply by entering in your terminal

ssh (public dns address)

Step 2: ssh into your head node and start installing programs

Once you are in, you have root access, so feel free to install R or any other programs that you may need.  I use the instructions and script provided here and then proceeded to install a few packages that are required for my simulations.  Also, make sure to follow his instructions (step 6) regarding ssh keys, otherwise your instances will not be able to communicate with each other. After doing so, I right click the instance in the console and select “create image.”  It will reboot my current image and take a few minutes to set up.

Step 3: Install the ec2 command line tools on the head node 

Though the console is a nice interface, it can be fairly tedious to use if we are dealing with a large number of instances, fortunately, amazon has a command line tool which can be installed through the package manager following these instructions.  After installing it, you will need to follow the instructions here to configure access.  You can test if it works by typing ec2-describe-regions and you should see

REGION eu-west-1 ec2.eu-west-1.amazonaws.com
REGION sa-east-1 ec2.sa-east-1.amazonaws.com
REGION us-east-1 ec2.us-east-1.amazonaws.com
REGION ap-northeast-1 ec2.ap-northeast-1.amazonaws.com
REGION us-west-2 ec2.us-west-2.amazonaws.com
REGION us-west-1 ec2.us-west-1.amazonaws.com
REGION ap-southeast-1 ec2.ap-southeast-1.amazonaws.com
REGION ap-southeast-2 ec2.ap-southeast-2.amazonaws.com

The command line tool can be used and manipulated with bash and various other linux utilities, but an R interface also exists, which I prefer because it limits the need for a complicated bash script.  However, it requires some xml packages which can be installed following these instructions (note: don’t user sudo apt-get install cran-xml as it installs an out of date version of the package) .

Once, everything is installed, you can load up R on your head node and run a command


cl <- startCluster(ami="amiID",key="KeyName",instance.count=Number of instances,instance.type="t2.micro",security.groups="security group ID",verbose=TRUE)

I had to slightly modify the package to allow for the “security.groups,” the resulting script will be posted on my github.

Now, rather than following the instructions and manually copying and pasting every public dns address, we can extract them with the command


machines <- cl$instances$dnsName

Now, we can follow the rest of his instructions (note: only one core per machine, so no need for the localhost command):


setDefaultClusterOptions(port=10187)

clust <- makeCluster(machines, type = "SOCK")

registerDoSNOW(clust)

Then, when we are done performing our parallel operations, we can stop the cluster by


stopCluster(clust) # terminates the SNOW cluster

terminateCluster(cl) # terminates the ec2 instances

This way, the only instance that you have to worry about turning off manually is your head node.  I would like to be able to run my code as a batch script, as opposed to interactively, but when I form the cluster, I get the warning

The authenticity of host [public DNS]can’t be established.
Are you sure you want to continue connecting (yes/no)?

I have to manually answer “yes” to each one.  However, this StackOverflow post tells me that I can disable this warning, so I may try that in the future.

One potential thing that might be disconcerting are the security issues that come with leaving port 10187 opened.  This seems to be the standard practice in R, but as I start to use this more seriously, I will look for more secure options.

Aside: terminating versus stopping

Though both terminating and stopping your instance will prevent you from incurring charges for the instance, terminating also deletes the EBS storage, which runs at about $.10 per gb-month (though I’m still not sure what that means).  Therefore, it is always a good idea to terminate worker nodes on a cluster when we are finished with them, as we don’t really need to save anything on them since we can always recreate them using an AMI.  If you take a long time between simulations, it might also be preferable to create an AMI of the head node and terminating it instead of leaving it stopped.

Final Impressions

Overall, once I was able to get everything set up, the ec2 system is fairly easy to work with and since I have root access and complete flexibility, it has a lot of potential as a replacement for a traditional cluster for my high performance computing needs.

 

Code Added

Code from my first blog post Parallelization in Rcpp via OpenMP is now available on github.  The serial code should run provided you install the package Rcpp, RcppArmadillo, and microbenchmark.  Regarding the parallel code, Linux operating systems have OpenMP support by default so it should run without issues.  I think it will also work on Windows, but as stated in the blog post, it will probably not run on a Mac without installing a new compiler. 

Linking R with Intel’s math kernel libraries

March 13, 2015:  I purchased a new computer, so I needed to re-link R with MKL.  I’ve modified the following post with a few corrections.

Matrix operations are usually the most computationally intensive task in a statistical program.  Fortunately, R and most modern programming languages rely on highly optimized matrix multiplication algorithms through an external FORTRAN library such as BLAS or LINPACK, which have been continuously refined since the 1970s.

Though R comes pre-installed with these highly optimized libraries, per the installation and administration manual, its default internal BLAS implementation is described as “well-tested and will be adequate for most uses of R.”  However, it does not take advantage of recent advances utilized by external libraries such as OpenBLAS, ATLAS, and MKL, which have multithreaded supported and highly optimized cache utlization.

Using the default BLAS libraries, R is still substantially slower than its proprietary counterpart Matlab.  A major reason for this discrepancy is due to Matlab’s use of Intel’s Math Kernel Libraries for BLAS and LAPACK operations.  By offering parallel support for matrix operations and utilizing Intel architecture-specific optimizations, the MKL result in Matlab dominating R in most benchmark tests  (see, for example, this performance comparison).

However, since Intel provides both its compilers (for C and C++) and its MKL free for students (provided that they are not compensated for software development) and build R from source using Intel’s compilers and linking to Intel’s MKL.  Alternatively, Revolution R offers a free academic license and links to an older version of MKL, but it appears to only have a Linux binary for Red Hat/Fedora.

I found building R from source and linking it with MKL to be a somewhat onerous process, with existing documentation being outdated, confusing, and often misleading, so I will outline my procedure in as much detail as possible, in hopes that it may aid someone in the future.

I’m working from a fresh install of Linux Mint 17, but I am assuming that these instructions should work for any Linux-based OS.  It’s also recommended that you have an Intel chipset in order to experience the most gains from the MKL.  The computer that I used for this tutorial is a Samsung Series 9 with an Intel i5 and 4GB of RAM.

Step 1: Download and install the Intel® C++ Studio XE for Linux

The Intel C++ studio comes with MKL, a C and C++ compiler, and a few other useful things such as Vtune for profiling (though it will not work for profiling C++ code within R, I will write a post on profiling in the future).

You will also need to make sure that you have the gcc and g++ libraries installed for these libraries to function properly.  You will get a warning if they are not available, though you can safely ignore the warning for the missing 32-bit libraries if you are on a 64-bit machine.  Most of the required files can be installed via the command:

sudo apt-get install build-essential

Step 2: Pre-requisites for building R from source

Several libraries and other dependencies are needed to build R from source.  First, install a fortran compiler.  I didn’t have a license for Intel’s fortran compiler “ifort,” so I installed GNU’s gfortran.  Next, I needed to install the X11-windows system, and the readlines library, which supposedly help improve the aesthetics of R’s terminal display.  Finally, Java is needed for an external “RJava” package, which is built in the default install.  With the proper configure flags, you can probably get around installing some of these libraries, but I didn’t feel like experimenting.  The commands to install these packages are:

sudo apt-get install gfortran
sudo apt-get install libx11-dev
sudo apt-get install xorg-dev
sudo apt-get install libreadline-dev

sudo apt-get install default-jdk

sudo apt-get install libcairo2-dev

If you find that you need another library when compiling something, a good rule-of-thumb is to type “sudo apt-get install [dependency-name]-dev, which generally will provide the needed dependencies for compilation.

3. Download R’s source and edit the configuration files

R’s source code is available here.  Upon downloading it and extracting it, open the file config.site and edit the following

CC='icc -std=c99'
CFLAGS='-O3 -ipo -xavx -openmp'
CXX='icpc'
CXXFLAGS='-O3 -ipo -xavx -openmp'

This is telling R to use Intel’s compilers with the appropriate flags (-03 turns on several optimization flags, -ipo is for inter-procedural optimization, -xavx intel-specific optimization, and openmp for openmp support).

4. Create environmental variables

Next, you need to tell R where MKL and Intel’s compilers are located is located.  Your installation directory should have several bash variables that you will need to run every time that you want to use MKL or intel’s compilers.  For convenience, you can add the following three lines to your ~/.bashrc file, so that they are automatically linked in each bash session.

source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64
source /opt/intel/composer_xe_2013_sp1.1.106/mkl/bin/mklvars.sh intel64
source /opt/intel/composer_xe_2013_sp1.1.106/bin/iccvars.sh intel64
export AR="xiar"
export LD="xild"

Note that your path may be different depending on the version of MKL and Intel composer that you install, but it should be along these lines.  There is an MKL link tool included in the MKL directory which can help you provide the link if you run into problems.

Finally, switch into root and run the following commands

source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64
source /opt/intel/composer_xe_2013_sp1.1.106/mkl/bin/mklvars.sh intel64
source /opt/intel/composer_xe_2013_sp1.1.106/bin/iccvars.sh intel64
export AR="xiar"
export LD="xild"

export MAIN_LDFLAGS='-openmp'
OMP_LIB_PATH=/opt/intel/composer_xe_2013_sp1.1.106/compiler/lib
MKL_LIB_PATH=/opt/intel/composer_xe_2013_sp1.1.106/mkl/lib/intel64
MKL=" -L${MKL_LIB_PATH} -L${OMP_LIB_PATH} \
-Wl,--start-group \
-lmkl_gf_lp64 \
-lmkl_intel_thread \
-lmkl_core \
-lm \
-Wl,--end-group \
-liomp5 -lpthread"
./configure --with-lapack --with-cairo --with-blas="$MKL" --build="x86_64-linux-gnu" --host="x86_64-linux-gnu" --enable-R-shlib

Previously, I did not include the lines ‘–enable-R-shlib,’ and ‘–with–cairo,’ which enable shared libraries (which are required to use external applications, such as Rstudio) and enable better graphics respectively.  To enable shared libraries, I referenced the following stack overflow post.  Note that some tutorials use the -lmkl_intel_lp64 flag instead of -lmkl_gf_lp64.  According to Ying H at Intel 

libmkl_gf_lp64  is for LP64 interface library for the GNU Fortran compilers and libmkl_intel_lp64 is for intel compilers.

So if, like me, you don’t have ifort, be sure to use the GNU fortran flag.

I re-ran all of the source variables in root because I only defined them in my personal ~/.bashrc user profile; they will not be defined for root or sudo.  I could have edited the file /etc/.bashrc, but I did not know that at the time.

If it worked correctly, you should see something along the lines of

R is now configured for x86_64-unknown-linux-gnu

Source directory: .
Installation directory: /usr/local

C compiler: icc -std=c99 -O3 -xavx -ipo -openmp
Fortran 77 compiler: gfortran -g -O2

C++ compiler: icpc -O3 -ipo -xavx -openmp
C++ 11 compiler: icpc -std=c++11 -O3 -ipo -xavx -openmp
Fortran 90/95 compiler: gfortran -g -O2
Obj-C compiler:

Interfaces supported: X11
External libraries: readline, BLAS(generic), LAPACK(in blas)
Additional capabilities: PNG, NLS
Options enabled: R profiling

Recommended packages: yes

If you don’t see this, check your config log. If you see an error in the following lines

conftest.c:210: warning: implicit declaration of function 'dgemm_'
configure:29096: $? = 0
configure:29103: result: yes
configure:29620: checking whether double complex BLAS can be used
configure:29691: result: yes

Something went wrong and R will ignore your MKL link and use its internal BLAS libraries.

 I adapted most of the configuration step from Intel’s tutorial except I found that I had to add an -lm flag to $MKL, or it wouldn’t recognize the math symbols in MKL. Myself and several others (you’ll find them if you do a google search) had problems with the use of double complex BLAS, but for me, they seemed to go away when I ran the configure step as root.  I have no explanation as to why.

Next, run make and make install while still root.  It should take about 10-15 minutes.  Once it finishes, run the command:


R CMD ldd ./bin/exec/R

And you should see:

linux-vdso.so.1 => (0x00007fff723d4000)
libmkl_gf_lp64.so => /opt/intel/composer_xe_2013_sp1.1.106/mkl/lib/intel64/libmkl_gf_lp64.so (0x00007f65cfa50000)
libmkl_intel_thread.so => /opt/intel/composer_xe_2013_sp1.1.106/mkl/lib/intel64/libmkl_intel_thread.so (0x00007f65cea92000)
libmkl_core.so => /opt/intel/composer_xe_2013_sp1.1.106/mkl/lib/intel64/libmkl_core.so (0x00007f65cd3d4000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f65cd0ab000)
libiomp5.so => /opt/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libiomp5.so (0x00007f65ccd90000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f65ccb72000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f65cc858000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f65cc61c000)
libreadline.so.6 => /lib/x86_64-linux-gnu/libreadline.so.6 (0x00007f65cc3d6000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f65cc1cd000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f65cbfc9000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f65cbdb3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f65cb9ec000)
/lib64/ld-linux-x86-64.so.2 (0x00007f65d0196000)
libtinfo.so.5 => /lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007f65cb7c3000)

Complications

I realized a few annoyances after building R from source. I typically run R from Emacs via Emacs Speaks Statistics, and when I first tried to open R from Emacs it would crash.  This is due to the mkl and icc environmental variables only being set through in bash, and not when I launch a program.  This means that if I want to use Emacs speaks statistics, I need to execute emacs from the terminal.

Additionally, I wanted to install the “Rcpp” package and I encountered the error: “catastrophic error: cannot open source file “bits/c++config.h,” but following the advice of this stackoverflow post, I was able to resolve it.

Perhaps the most annoying issue was the poor quality of R’s default graphics, which produced plots like this:

Rplot

I posted a question on StackOverflow, but didn’t get much help, so I did some research.  It turns out that I need to install libcairo 2 and ./configure with the option –with-cairo.  Then, edit your .Rprofile in your home directory to include:

setHook(packageEvent(“grDevices”, “onLoad”),
function(…) grDevices::X11.options(type=’cairo’))
options(device=’x11′)

After doing so, your graphics should appear as they normal would after installing a package from the repository.

There may be more complications in the future in installing packages that require compiled code, but my initial performance gains appear quite substantial.  I routinely see R executing working at 400% CPU utilization when running some of my computationally intensive procedures.  I will run some benchmarks in the future, but I’ve noticed performance gains roughly similar to those of the Revolution R’s blog, with even greater gains with my parallel code in Rcpp, which is automatically linked to MKL.

Parallelization in rcpp via OpenMP

Obtaining good computational performance in R can be a frustrating experience.  As with most interpreted languages, we are taught to eschew loops in favor of vectorized operations, which requires learning the somewhat byzantine suite of functions that comprise the “apply” family.  After rewriting your code to utilize these procedures, you may experience a 1.5x-2x speedup, which is substantial, but still can be lacking compared to R’s compiled counterparts.

Vectorized code is amenable to parallelization through packages such as “mcapply,” but if you are serious about performance, you will likely be told that “You shouldn’t be programming in R at all,” and will instead be directed to outsource your computationally intensive code to a compiled language, using an interface such as “Rcpp.”  Of course, since C++ is compiled, “for” loops are not a problem, so you will have to convert your efficient vectorized code back to a naive nested loop structure.  However, upon doing so, you will likely notice a 5-10x performance improvement, even over parallelized vectorized R code.

Rcpp is particularly useful if you have to construct your own algorithms from scratch.  My research involves applying regularization methods to multivariate time series.  Since R’s existing packages for Lasso and its structured counterparts offer limited multivariate support and are not amenable to my cross validation procedure, I am usually forced to write my own algorithms.  The example that I will be working with applies a Lasso penalty to a vector autoregression via coordinate descent, similar to the procedure described in Friedman et al. (2010).

The initial performance gains by converting my algorithm to Rcpp can be seen in the following benchmark plot, which compares the performance of three variants of my procedures: one naively coded in R, one taking advantage of vectorized operations, and one coded using Rcpp

InitialBenchmarks

It is not incredibly straightforward to parallelize from within Rcpp.  From my experience, the type of parallelization of most interest to statisticians typically consists of a for loop with independent iterates; a type of problem that is considered “embarrassingly parallel.”  For this type of parallelization, the OpenMP framework is ideal, as it is included as a default library on most Linux distributions and requires little modification to your code.

The documentation regarding the use of OpenMP within R is sparse.   Writing R Extensions contains a brief section on OpenMP mentioning the R has “some” support for it and details how to use it in a package.  Dirk Eddelbuettel’s website contains a toy example using OpenMP with Rcpp via the “inline”  package, but this package is not usable if you are required to reference more than 1 function; this requires the use of the “sourceCpp.”  There are a few other examples floating around, but they are all very rudimentary.

As a result, I will offer a brief tutorial on how to enable OpenMP support within Rcpp for a non-trivial program.  Specifically, I will parallelize the for loop in my Lasso-VAR implementation which cycles over different penalty parameters.  These iterations are independent, but larger penalties will set more variables to zero, requiring fewer iterations, hence less time to solve, so some sort of load balancing should be taking into account.


 

Pre-requisites:

  • A modern Linux OS with OpenMP support 

In order to follow this tutorial, your operating system must support OpenMP.  Per an excerpt from “Writing R Extensions:”

[V]ersion 3.0 (May 2008) is supported by recent versions of the Linux, Windows and Solaris platforms, but portable packages cannot assume that end users have recent versions. The compilers used on OS X 10.6 (‘Snow Leopard’) had partial support for OpenMP 2.5, but this is not enabled in the CRAN build of R. (Apple have discontinued support for those compilers, and their alternative as from OS X 10.9, clang, has no OpenMP support. A project to add it has been announced at http://clang-omp.github.io/, but it is unknown when or even if the Apple builds will incorporate it.)

The performance of OpenMP varies substantially between platforms. Both the Windows and the Apple OS X (where available) implementations have substantial overheads and are only beneficial if quite substantial tasks are run in parallel.

Unfortunately, this makes OpenMP-based programs difficult to justify for a package development  framework.  Nonetheless, I have tested my programs on three modern Linux distributions (Mint, Fedora, and Ubuntu), and I have realized substantial performance increases.  Throughout the rest of this tutorial, I will assume that you are using a modern Linux distribution with OpenMP 3.0 support.

  • The most up-to-date versions and some familiarity with R, Rcpp, and Rcpp Armadillo
  • Functional code called through “sourceCpp” that contains a “for” loop in which to parallelize (optional)

 

Preliminaries:

In order for the compiler to recognize openMP, we need to include the OpenMP header flag.  Additionally, Rcpp has a plugin that will automatically add the openMP compiler flags.  Add the following to the top of your program.

#include <omp.h>
// [[Rcpp::plugins(openmp)]]

Next, recall that R is single threaded, which means that any of the variables read into your R function are native R or Rcpp data structures (e.g. lists, NumericVectors, NumericMatrix, etc), will need to be converted to either native C++ objects or (preferably) Armadillo objects, which  have multi-thread support.

Then, identify the loop in which you want to parallelize and add the following lines above it

omp_set_num_threads(int)
#pragma omp parallel for

The first line selects the number of cores to use.  This should not exceed the number of cores in your system.  At this point, if your code is simple enough, it might run, so go ahead and give it a try.  If your program appears to be successful, try running it several hundred times in a function like “microbenchmark” to test performance and ensure that there are no potential errors.

Debugging

Debugging parallel programs can be very difficult, as many errors occur sporadically; only when certain unknown conditions are triggered.  I’ve found that two types of errors tend to occur most often.

  • Stack Imbalance/Overflow

As primarily an R programmer, I tend to take Rcpp and Armadillo’s automatic memory management for granted.  OpenMP seems to require a bit more active memory management.  This doesn’t mean that you have to explicitly make calls to “malloc” and “free,” but you should avoid declaring large objects within your parallel loop.  For example my original lasso kernel function initialized the following variables

int n=Y1.nrow(),k=Y1.ncol();
arma::mat Y(Y1.begin(),n,k,false);
int n1=Z1.nrow(),k1=Z1.ncol();
arma::mat Z(Z1.begin(),n1,k1,false);
int n2=B1.nrow(),k2=B1.ncol();
arma::mat B(B1.begin(),n2,k2,false);
arma::mat BOLD=B;
arma::mat R=zeros<arma::mat>(n,k);
arma::colvec ZN = as<arma::colvec>(Znorm);
arma::colvec Z2;
arma::colvec G1;
NumericVector G;
arma::uvec m;

First, I have since realized that I can read armadillo objects directly from R, so it is pointless to convert Rcpp objects to their armadillo counterparts.  If I absolutely need to do so, I should do it outside of the parallel region in order for OpenMP to better allocate memory to each thread.  Second, I am storing far too many variables compared to what I actually need.

My inner loop is

for(int i = 0; i < k2; ++i)
{
m=ind(k2,i);
R=Y-B.cols(m)*Z.rows(m);
Z2=trans(Z.row(i));
G1=R*Z2;
G=as<NumericVector>;(wrap(G1));
B.col(i)=ST4(G,gam)/ZN[i];
}

For some reason, I convert an arma colvec into a NumericVector because my original elementwise soft-threshold function ST4 requires a NumericVector input.  Though useful for debugging purposes, there is no point to store all of these matrices.  Instead, I could reorganize it as follows

for(int i = 0; i < k2; ++i)
{
m=ind(k2,j);
B.col(j)=ST4((Y-B.cols(m)*Z.rows(m))*trans(Z.row(j)),gam)/ZN[j];
}

This accomplishes the same task while substantially reducing storage costs.

If you absolutely have to define large data structures within a parallel region and you are getting stack imbalance or overflow errors, you may want to try adjusting the OpenMP stacksize via the environmental variable

setenv("OMP_STACKSIZE","XM",1);

Replace X with the amount you want to allocate; the default is 4 megabytes.

  • Segmentation Fault/Memory Not Mapped

These types of errors tend to occur for me when I am performing some sort of indexing operation, like accessing non-contiguous elements of a matrix or vector.  There are a few things that I found that work to prevent them.  First, as mentioned in the previous section, for memory management purposes, I initialize most variables outside of the parallel region.  This requires that I explicitly designate “private” and “shared” variables.  A shared variables exist only in one memory location, which all threads will read or write to.  An example would be an array in which you store results from independent iterates in the parallel loop.  Shared variables can be useful because they save memory, but they can lead to errors if more than one thread is accessing the same location at the same time.

Most variables will end up being private, in which a new variable is declared for each thread.  This results in increased overhead, but prevents memory conflicts.  On another note, constant data structures, such as data matrices should be designated as such (simply add “const” in front of the variable name) as they can be accessed by all threads without the “shared” designation.

My final parallel for loop looks like:

#pragma omp parallel for shared(bcube,bcube2,b2,gammgrid) private(i,gam,B1F2,B1,m,nu) default(none) schedule(auto)

The bcube objects are armadillo cubes (i.e. three-dimensional arrays) which store the initializations for the coefficient matrix as well as the final result, which naturally fit under the shared variable designation.

The “schedule” option takes into account load balancing via a scheduling strategy, per the textbook Parallel Programming for Multicore and Cluster Systems, OpenMP has the following scheduling options:

  • Static, “blocksize-” assigns iterates of size “blocksize” to threads in a “round-robin” fashion.
  • Dynamic, “blocksize-” distributes blocks of size “blocksize” to threads as soon as a thread finishes with a previously assigned block.
  • Guided, “blocksize-” dynamic scheduling of blocks with decreasing size.
  • Auto: Delegates scheduling to the compiler/runtime system.

The default option of static does not work particularly well when the workload differs substantially across iterates, as my example does.  Dynamic scheduling might be preferred in this case and one can play around with the blocksize to experience performance gains.  Typically, I find a slight performance increase by setting an automatic scheduling strategy, leaving scheduling in the hands of the compiler.

If you still experience segfault related errors after trying the above, the next step is to start placing your code in critical regions.  A critical region is a region of the code that can only be executed by one thread at a time.  Obviously, you don’t want to have any computationally demanding operations in a critical region, as it must be run in serial, but sometimes it cannot be avoided.  My code contains a small critical region in which I subset non-contiguous columns of  the coefficient matrix.  I believe that the issue likely stems from my use of an IntegerVector (Rcpp data structure) to construct the index.  However, since it is a very simple operation, it does not really degrade performance, though I will look into an alternative for future versions.  The syntax for a critical region is

#pragma omp critical
{

}

Note that the brackets must be on the next line; for some reason C++ does not recognize the bracket if it is on the same line as #pragma.  If the conflicting operation is atomic (simple addition or subtraction), the “#pragma omp atomic” operation can be used instead, with less of a performance degrade.

Finally, after sorting out these issues, the program runs with a considerable performance improvement as seen in the following figure.

OpenMPSpeedup.1

I am under the impression that the long tails are the result of cache misses based on accessing non-contiguous elements of my data matrix which I will address in a subsequent post on profiling.  The final version of my code will be posted on GitHub in the coming days.