Creating a virtual cluster in R using Amazon ec2

I was recently asked by my department to evaluate Amazon’s elastic compute cloud (ec2) to see if it could serve as a viable alternative to our aging cluster.  There is a great deal of documentation on setting up R with ec2, but it tends to be either far too simplistic or needlessly complicated.  I find myself in need of a middle ground; I am comfortable using Linux from the command line, so I have no need to use something like RStudio server, but my primary purpose in using ec2 is to run “embarrassingly parallel” simulations, so I don’t need to work with distributed memory or anything of that sort, so setting up Starcluster would be overkill.

This brief tutorial provided me with the information I needed, which I slightly modified to fit my specific needs.  I will briefly expand upon it and offer a few comments.

Pre-requisites:

I’m going to assume that you are working from Linux; I assume that the instructions are roughly the same on a mac.  Other than this, you don’t really need anything else aside from ssh, which should be included in a default install on every modern linux distribution.

Step 1: Create an Account and set up your head node

Go here in order to create an account, you will need to verify your identity and enter your billing information.  After doing so, you should end up at the ec2 Dashboard, at which point, you  should click “Launch Instance.”  From what I can tell, an instance is basically a virtual machine with varying amounts of memory (dependent on cost) from which you have root access.  Unlike a traditional cluster, you are not required to go through administrators in order to install programs or edit configuration files, which to me is very advantageous.  Additionally, you don’t really need to worry about cluster traffic, as a new virtual environment is created each time you launch your instance.  There were a few times where I was unable to launch my instance because of traffic issues in my region, but I was able to get around this by creating a new instance in another region.  The pricing options are listed here.  I use the “US East” region, as it is the closest to me geographically.

In your first step, you will chose an Amazon Machine Image (AMI).  You have the option of using one that is already pre-configured with R and Rstudio, or creating your own.  I chose to create new AMI with Ubuntu server edition, as I am most familiar with the Ubuntu framework.

With regard to what instance type to choose, I would consider a general purpose one.  Unless you are doing something very memory intensive, there is no need to use a “compute optimized” cluster, as they are considerably more expensive.  Moreover, one can create a cluster from several micro-instances, which cost just over $.01 per hour to run.  For an initial instance, I chose the “m3.xlarge,” which will serve as a “head node” for my cluster.  I haven’t noticed performance degrades with cheaper instances.

Just set one instance for now and click through the next through prompts, possibly decreasing the EBS storage from 8gb if you don’t think you will need that much space (you are charged about $.10 per gigabyte per hour).  Also, edit the security settings to open port 10187, as per the cited instructions.  Name this security group something you will remember.  You are going to be asked to create a key pair, which, in lieu of a password, is how you will access your instance.  Choose to create a new Keypair, download it and store it in a safe place.  Then, you will need to edit (or create a new file “~/.ssh/config”  of the following form

Host Public DNS address of your instance (available from the console)
User ubuntu  (username is ubuntu if it is an ubuntu machine, it will vary for other OSs)
Hostname Public DNS address again (not sure if both are necessary)
Port 22
IdentityFile “/path/to/key”

Now, you will be able to connect simply by entering in your terminal

ssh (public dns address)

Step 2: ssh into your head node and start installing programs

Once you are in, you have root access, so feel free to install R or any other programs that you may need.  I use the instructions and script provided here and then proceeded to install a few packages that are required for my simulations.  Also, make sure to follow his instructions (step 6) regarding ssh keys, otherwise your instances will not be able to communicate with each other. After doing so, I right click the instance in the console and select “create image.”  It will reboot my current image and take a few minutes to set up.

Step 3: Install the ec2 command line tools on the head node 

Though the console is a nice interface, it can be fairly tedious to use if we are dealing with a large number of instances, fortunately, amazon has a command line tool which can be installed through the package manager following these instructions.  After installing it, you will need to follow the instructions here to configure access.  You can test if it works by typing ec2-describe-regions and you should see

REGION eu-west-1 ec2.eu-west-1.amazonaws.com
REGION sa-east-1 ec2.sa-east-1.amazonaws.com
REGION us-east-1 ec2.us-east-1.amazonaws.com
REGION ap-northeast-1 ec2.ap-northeast-1.amazonaws.com
REGION us-west-2 ec2.us-west-2.amazonaws.com
REGION us-west-1 ec2.us-west-1.amazonaws.com
REGION ap-southeast-1 ec2.ap-southeast-1.amazonaws.com
REGION ap-southeast-2 ec2.ap-southeast-2.amazonaws.com

The command line tool can be used and manipulated with bash and various other linux utilities, but an R interface also exists, which I prefer because it limits the need for a complicated bash script.  However, it requires some xml packages which can be installed following these instructions (note: don’t user sudo apt-get install cran-xml as it installs an out of date version of the package) .

Once, everything is installed, you can load up R on your head node and run a command


cl <- startCluster(ami="amiID",key="KeyName",instance.count=Number of instances,instance.type="t2.micro",security.groups="security group ID",verbose=TRUE)

I had to slightly modify the package to allow for the “security.groups,” the resulting script will be posted on my github.

Now, rather than following the instructions and manually copying and pasting every public dns address, we can extract them with the command


machines <- cl$instances$dnsName

Now, we can follow the rest of his instructions (note: only one core per machine, so no need for the localhost command):


setDefaultClusterOptions(port=10187)

clust <- makeCluster(machines, type = "SOCK")

registerDoSNOW(clust)

Then, when we are done performing our parallel operations, we can stop the cluster by


stopCluster(clust) # terminates the SNOW cluster

terminateCluster(cl) # terminates the ec2 instances

This way, the only instance that you have to worry about turning off manually is your head node.  I would like to be able to run my code as a batch script, as opposed to interactively, but when I form the cluster, I get the warning

The authenticity of host [public DNS]can’t be established.
Are you sure you want to continue connecting (yes/no)?

I have to manually answer “yes” to each one.  However, this StackOverflow post tells me that I can disable this warning, so I may try that in the future.

One potential thing that might be disconcerting are the security issues that come with leaving port 10187 opened.  This seems to be the standard practice in R, but as I start to use this more seriously, I will look for more secure options.

Aside: terminating versus stopping

Though both terminating and stopping your instance will prevent you from incurring charges for the instance, terminating also deletes the EBS storage, which runs at about $.10 per gb-month (though I’m still not sure what that means).  Therefore, it is always a good idea to terminate worker nodes on a cluster when we are finished with them, as we don’t really need to save anything on them since we can always recreate them using an AMI.  If you take a long time between simulations, it might also be preferable to create an AMI of the head node and terminating it instead of leaving it stopped.

Final Impressions

Overall, once I was able to get everything set up, the ec2 system is fairly easy to work with and since I have root access and complete flexibility, it has a lot of potential as a replacement for a traditional cluster for my high performance computing needs.