This post is not about parallel computing. It’s about splitting long jobs across several machines and using Google Cloud Platform and Dropbox to do it cheaply and effectively. We researchers often run simulations that go on for several hours. Many times it’s about running the same job several times but with different parameters for every run. Assuming these are independent jobs, it makes sense to run them in parallel on several machines to save time. The problem here for most people is that they don’t have dozens of powerful machines at their disposal. Google Cloud Platform to the rescue! (Nothing against AWS, but I used up their free credit many years ago before I needed the compute capabilities.)
Currently, Google Cloud Platform has a promotion that gives you $300 credit to use up within 60 days of signing up. This is a substantial amount to use up in two months – I roughly estimate that it could run five dual core instances with 8G of RAM and 50G hard drives non-stop for two months and still have about $35 of unused credit at the end. Say what? Now if you didn’t need the instances running 24×7, you could possibly have many more of worker machines running in parallel with 4/8 cores and dollops of RAM and SSD storage and still be using the free credit all this time. (Note 1: Google gives discounts on sustained use, just like AWS does, so running a machine for half the time does not exactly halve the bill. Note 2: The free trial puts a limit of 8 cores per zone, so your machines are limited to a maximum of 8 cores. Note 3: There are total 5 zones available, so you may not exceed 40 cores running at a time, which is still 10 quad core machines if you will). With a paid account, you could go even higher.
In my case I needed to run a simulation with five different sets of parameters, and each run took about two hours to finish. I could either run them all on my machine and wait ten hours to get the results, or I could spin up four cloud instances on the side and have everything ready in two. You get the advantage. If there was no free credit, would I still do it? Absolutely! Getting results quickly and moving on to the next thing is priceless!
You do need a credit card to sign up for the free trial, which is a bummer if you are a student. I hope they remove this requirement for students. I have also not compared prices to see if Google is offering cheaper compute resources than Amazon. There are probably other cheaper options available for non-compute uses like hosting small websites.
Dropbox for syncing code and results
I have some Linux experience, so setting up multiple instances was not difficult for me. It might not be easy for others so I’ll talk about how I did things and the tricks I learned. My goal here is to save you time. If your simulations only run on Windows, I’m not sure how helpful this post will be to you – sorry! Where does Dropbox fit in? You guessed it. To synchronize the code and results. Most people have done enough referrals to bump up their quota from the standard 2G that Dropbox offers. For others, there is Google Drive with 15G of free space (BUT NO OFFICIAL LINUX CLIENT! ARE YOU KIDDING ME!?), so if you can get one of the third party Linux clients working satisfactorily, good for you, but I will only stick with Dropbox for this post. I have a paid Dropbox account with 1TB of storage so it wasn’t an issue for me. I can’t thank Dropbox enough for the excellent work they have done.
cat /proc/cpuinfo tells me that the instance I’m looking at has 4-cores of Xeon CPUs @ 2.30GHz. These are supposed to be dedicated to my instance. Even though my simulations only take about two cores to 100% utilization, they run noticeably slower on the cloud instance than on my local workstation with an 8-core i7-4770 CPU @ 3.40GHz. The workstation can even run a second simulation inside a VM with no major slowdowns. But certainly none of this matters unless I have benchmarks to go with my claims. I will try to post some here later.
If you don’t have a Google account (Gmail), you will need to create one. You can then sign up for the free trial or a paid account. Google Cloud Platform is an umbrella under which they provide many services – for hosting apps, running databases, networking, and machine learning, to name a few. We are looking at the Compute Engine component. You can open your Google Cloud Platform console by going to https://console.cloud.google.com/. You must create a new Project which will organize all your compute instances, app engines, databases, etc. related to a project together. Our project will only have compute engine instances, but you must create one to get started.
This is what your console looks like (with a project already open):
You will be automatically prompted to create a new project when you first sign in. Pick a name for your project:
Wait a few seconds till things are ready. You will see a spinning circle till then which will turn into a notification once it is done.
Once created, your new project’s dashboard will look like this:
Creating your first instance
Click on the expand menu button and click on Compute Engine:
You can then click on the Create instance button under VM instances.
Here you pick a name for the instance (which is also its hostname), a zone (there are limits on how many cores you can be running at the same time in a zone), number of cores and memory, and a boot disk. The default machine types can be customized to change the number of cores and amount of RAM to suit your needs.
For the boot disk, you can choose from the available Linux distros, choose between SSD and HDD, and set a size for the disk. You can also choose a snapshot of a disk you saved earlier.
I’ll go with Ubuntu 16.04 LTS and a 20GB standard boot disk. The documentation says that larger disks will have better performance, so if you have I/O intensive tasks, you may want to go for a larger SSD type disk.
As soon as you click on the create button, the machine performs its first boot (to auto-configure the hostname, networking, and other management features) and you start getting billed for it. You can always shut it down when not using it but there is a minimum 10 minute limit for billing.
Setting up SSH keys
You can set up per-instance SSH keys or project-wide SSH keys. I prefer the latter, as once set up, they are automatically installed on all instances in the project. All user acounts corresponding to the installed keys are also created automatically. To set up your SSH key, click on the Metadata section and select the SSH tab. Copy the contents of your public key and paste it into the box (if you don’t have one, you can search the web for how to create an SSH key pair). Your username will be automatically picked up from the key – which is usually the username you have set up on your local machine. Click the save button and you are all set. Click on the VM Instances button to go back.
You can now access your machine by simply running:
Here is a connection session:
~$ ssh 22.214.171.124 The authenticity of host '126.96.36.199 (188.8.131.52)' can't be established. ECDSA key fingerprint is SHA256:s5d4f54sd5f4s5 Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '184.108.40.206' (ECDSA) to the list of known hosts. Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-38-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage Get cloud support with Ubuntu Advantage Cloud Guest: http://www.ubuntu.com/business/services/cloud 0 packages can be updated. 0 updates are security updates. The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. To run a command as administrator (user "root"), use "sudo <command>". See "man sudo_root" for details. rohit@machine1:~$
You can see the username and hostname on the machine are correctly set up.
Install your software
You should now install all the software you will be running on the machines. Remember that this is your template machine which will be duplicated to create your other instances, so try to keep things clean and complete. If you are running free or open source software, it won’t have complicated licensing mechanisms and the clones will work perfectly fine without the need for re-installing again and again. Even proprietary licensed software installations, like MATLAB, can be duplicated this way but may need re-activation on each instance depending on the license type.
For my personal setup, I installed the C++ compiler, Octave, and R.
sudo apt-get install build-essential sudo apt-get install octave octave-image octave-signal sudo apt-get install r-base
You can also launch R and install any R packages you need.
Although I have only used tools that can be installed and run without an X server, you can install one and use it headless/remote using VNC.
Follow “Dropbox Headless Install via command line (64-bit)” instructions at https://www.dropbox.com/install-linux. Once you have run the commands on that page, you will be asked to authenticate by visiting a URL. The Dropbox daemon will start running and syncing your files.
You can kill the sync process for now by pressing Ctrl-C. We will first download the Dropbox command line tool to monitor Dropbox and also set up the daemon to autostart. We will also set up exclude folders to prevent personal or unnecessary files from syncing to the nodes.
Fetch the Dropbox command line tool
$ mkdir ~/bin $ cd ~/bin /bin$ wget --content-disposition https://www.dropbox.com/download?dl=packages/dropbox.py --2016-10-01 03:17:34-- https://www.dropbox.com/download?dl=packages/dropbox.py Resolving www.dropbox.com (www.dropbox.com)... 220.127.116.11 Connecting to www.dropbox.com (www.dropbox.com)|18.104.22.168|:443... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://linux.dropbox.com/packages/dropbox.py [following] --2016-10-01 03:17:35-- https://linux.dropbox.com/packages/dropbox.py Resolving linux.dropbox.com (linux.dropbox.com)... 22.214.171.124, 126.96.36.199, 188.8.131.52, ... Connecting to linux.dropbox.com (linux.dropbox.com)|184.108.40.206|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 116583 (114K) [application/octet-stream] Saving to: ‘dropbox.py’ dropbox.py 100%[===================>] 113.85K --.-KB/s in 0.02s 2016-10-01 03:17:35 (5.24 MB/s) - ‘dropbox.py’ saved [116583/116583] /bin$ chmod +x dropbox.py /bin$ cd $ ~/bin/dropbox.py Dropbox command-line interface commands: Note: use dropbox help <command> to view usage for a specific command. status get current status of the dropboxd throttle set bandwidth limits for Dropbox help provide help puburl get public url of a file in your dropbox's public folder stop stop dropboxd running return whether dropbox is running start start dropboxd filestatus get current sync status of one or more files ls list directory contents with current sync status autostart automatically start dropbox at login exclude ignores/excludes a directory from syncing lansync enables or disables LAN sync sharelink get a shared link for a file in your dropbox proxy set proxy settings for Dropbox $ ~/bin/dropbox.py status Dropbox isn't running!
Currently, the Dropbox daemon is not running, but we will make it run on boot in the background. Unlike other services, the Dropbox daemon should-not/need-not be run as root. Thus it is best set up to launch as your personal cron job.
$ crontab -e no crontab for rohit - using an empty one Select an editor. To change later, run 'select-editor'. 1. /bin/ed 2. /bin/nano <---- easiest 3. /usr/bin/vim.basic 4. /usr/bin/vim.tiny Choose 1-4 : 3
Add the following line to the end of the file:
Save and exit the editor. Now reboot the machine with:
$ sudo reboot
Wait about a minute to allow the machine to come back up, then ssh into it again. You can now check the Dropbox status:
~$ ~/bin/dropbox.py status Starting...
And once the sync completes, you should see
~$ ~/bin/dropbox.py status Up to date.
Exclude personal/unnecessary files
You can exclude folders that contain personal files or ones which are too big to fit on your instances by
~$ cd Dropbox ~/Dropbox$ ~/bin/dropbox.py exclude add "excluded folder 1" "excluded folder 2" "excluded file 1"
You can add as many folders as you want, but be aware that it is a slow process. Also, you will need to re-do the authentication and exclude steps on each instance.
Snapshotting and duplicating your machine
Now we get to the real time-saving part – creating other instances by duplicating this one. On the VM-instances page, you may notice a “Clone” button. But it doesn’t clone the way we expect it to. The clone is only similar in configuration – CPU cores, memory, disk size and distro – but your data is not duplicated! To do an effective clone, we will have to:
- “Unconfigure” the machine and turn it off.
- Take a snapshot of the disk.
- Create new instances using the snapshot.
- Re-authenticate Dropbox if needed.
Prepping for the snapshot
Snapshots that have to be deployed to multiple machines usually have things like hostnames, ssh keys, network config etc. erased so that they can be given unique values on each machine. The beauty of Google Cloud Platform is that it handles all these things transparently. The only “unconfiguration” you need to do is of Dropbox. If you blindly clone a Dropbox installation, Dropbox will think that all the clones are the same machine and you will get erratic syncing (the behavior at the time of this writing). Thus each machine should be independently authenticated – which is fast and painless. The data that is already present in the Dropbox folder is not re-downloaded as Dropbox is smart enough not to do that.
First stop the Dropbox daemon and then delete the “~/.dropbox” hidden folder:
$ ~/bin/dropbox.py stop Dropbox daemon stopped. ~$ rm -rf "~/.dropbox"
Shut down the instance with
sudo poweroff and we are ready to take the snapshot.
Taking the snapshot
Click on Snapshots in the sidebar, then click on Create Snapshot.
Give a name for the snapshot, a description, and for the source disk, select the machine you just created. Click create.
Note that keeping snapshots is not free. You are charged per GB of snapshot storage, but the prices are insanely low.
Create a new instance using the snapshot
Go back to the VM instances page. Click the Create Instance button.
For the boot disk, click on Change, and this time visit the Snapshots tab. Select the snapshot you made and finish creating the instance.
Wait for the instance to boot up, and you should now have a clone with your data, installed programs, and correctly configured hostname and SSH keys. You can now simply SSH to the new instance.
Once we are logged in, we would want to register this new machine with Dropbox and enable sync. First we kill the Dropbox daemon running in the background. We then launch it manually to get the auth URL.
~$ pkill dropbox ~$ ~/.dropbox-dist/dropboxd This computer isn't linked to any Dropbox account... Please visit https://www.dropbox.com/cli_link_nonce?nonce=lkj8u5h5jee8t to link this device.
Once complete, you can kill the daemon with Ctrl-C and reboot. You can then go ahead and exclude the folders you wish to from this instance.
You can repeat this process to create as many clones as you may need. You can then start/stop those instances, wait for the Dropboxes to sync up, and start your simulations. Your simulation results that are written to Dropbox are also accessible from anywhere, even when the compute nodes are powered down.
Since Google Cloud Platform recycles their public IP address, it is very likely that your new instance will get the same IP address you got for a different instance earlier. Usually, this is a sign of the server being hijacked, but not in this case. The resolution is given in the error message itself.
~$ ssh 220.127.116.11 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ECDSA key sent by the remote host is SHA256:hjdshf8745j4h53k4jh5I. Please contact your system administrator. Add correct host key in /home/rohit/.ssh/known_hosts to get rid of this message. Offending ECDSA key in /home/rohit/.ssh/known_hosts:70 remove with: ssh-keygen -f "/home/rohit/.ssh/known_hosts" -R 18.104.22.168 ECDSA host key for 22.214.171.124 has changed and you have requested strict checking. Host key verification failed. ~$ ssh-keygen -f "/home/rohit/.ssh/known_hosts" -R 126.96.36.199 # Host 188.8.131.52 found: line 70 /home/rohit/.ssh/known_hosts updated. Original contents retained as /home/rohit/.ssh/known_hosts.old ~$ ssh 184.108.40.206 The authenticity of host '220.127.116.11 (18.104.22.168)' can't be established. ECDSA key fingerprint is SHA256:hj3h45h35jh4hkI. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '22.214.171.124' (ECDSA) to the list of known hosts. Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-38-generic x86_64)
Broken SSH connections
You will often find your SSH connections getting broken or timed out, especially when you are on WiFi, or if your machine running the SSH client goes to sleep. To keep the sessions active and to keep your simulations running even in the midst of disconnections, use a program like
screen. I have also heard good things about
tmux, but have not used it yet.
First thing you do after launching an instance and logging on to it is to run screen.
~$ screen Screen version 4.03.01 (GNU) 28-Jun-15 ... [Press Space for next page; Return to end.]
You will be taken to a regular shell where you can do your tasks. Now, if for some reason your SSH connection is broken, create a new SSH session to the instance and resume the screen session.
~$ screen -r
Using screen takes away the scrolling features from the console, but you can press
Esc keys in screen to enter a scrollable mode.
Checking your quotas
During the free trial, you can only run 8 cores in a particular zone (of which there are five: east, west, central, Europe, and Asia). You can check your usage by clicking the Quotas link in the sidebar.
If you or the school/employer you work for has security policies for the storage, transmission, or use of data, code, and anything else, please be sure to follow them. It is not okay to store data on unecrypted disks or in cloud services if your workplace has rules against doing that.
Remember to apply all important security updates, especially ones related to SSL and SSH. Do not open any ports in the cloud platform firewall unless you know what you are doing. Disable password authentication for SSH and install an intrusion prevention tool like fail2ban.
I hope this guide helps you get a head start with using Google Cloud Platform to advance your research. With Dropbox, your most up-to-date code will be available across all your machines, and your results will be too. By cleverly making and using snapshots, you can scale up to dozens of worker machines without breaking a sweat. Here you can see multiple instances running, but not really doing anything CPU intensive.
Remember to power off instances when you won’t be using them for long periods. Happy Cloud Computing!
If you liked this post, please “like” it on LinkedIn: www.linkedin.com/hp/update/6187867846433964032