Category Archives: Research

Fixing Keras Hangups in Jupyter Notebooks

Keras is a wonderful Python library for high level implementation of deep learning networks. It provides a neat customizable interface for designing intricate sequential and recurrent neural networks and a fine grained control on the training algorithm. For its backend, it transparently allows the usage of either Theano or Tensorflow which seamlessly abstracts the CPU and GPU implementations of the complicated algorithms and data flows. Modern compiler design and expression templates have really come a long way!

I love programming in Jupyter notebooks because it leads to a reproducible record of my work, and it provides a very convenient interface to run your code on a headless work machine or in the AWS/Google cloud. Jupyter notebooks are also used heavily in machine learning courses taught online and in classrooms, because they help the instructor abstract the ugly details of setting up the environment into VMs or cloud images.

One frequently encountered problem with training Keras models in Jupyter notebooks is of getting a ‘WebSocket ping timeout’. My understanding of the issue is that the training progress bar updates overwhelm the Jupyter client-server connection and communication freezes. This is an often referenced issue, and some of the solutions involve redirecting the stdout to a file to relieve the stress of progress bar updates, but those prevent you from looking at the training progress and important messages right in the notebook. One elegant solution I like is disabling the default text progressbar in Keras and using the keras_tqdm progressbar. tqdm is a neat modern looking progressbar with Jupyter notebook support, so they don’t time out the connection with constant updates. The author has put together a really convenient Keras callback class that draws and updates the progress bars in the notebook. I have successfully used it to fix my timeout issues when training Keras models.

One little improvement that I contributed is a bug fix to support dynamic batch size training. Keras provides two ‘fit’ functions for training: a fit() function for fixed size batches where all data is loaded in a large numpy array, and a fit_generator() function where batches are generated on the fly from a generator. Batch generators have several advantages, the most important one being that datasets too large to fit into memory can be processed by reading them from the disk in chunks. Another significant advantage is being able to ‘generate’ data – for example, by applying transforms and crops to images, by adding noise, or by raw synthesis. The generators do not need to abide by a fixed batch size; they can yield batches with a different number of items every time (although many generators have a fixed yield size). This breaks the progress counting mechanism in keras_tqdm code. I have detailed my process in the bug report and the pull request. Till the pull request is merged, you can use my fork. Follow these install instructions:

git clone https://github.com/rohitrawat/keras-tqdm.git
cd keras-tqdm
python setup.py install

I hope this post helps people who are running in the Websocket ping error, or in general, are unable to run keras_tqdm with fit_generator.

Automatically shutting down Google cloud platform instance

One of the benefits of using Google cloud platform or AWS for computational tasks is that you can get resources on demand and not pay for round the clock usage. But even with intermittent use, you may get surprised by the billing amount at the end of the month. It feels worse when you remember that you had left an instance running for a couple of days after your simulation had stopped running. The cores were idling, but you still must pay for them.

One of the approaches you can use is to launch your simulation using a script which shuts down the machine after your program finishes execution. You can throw in a delay in between to allow your results to sync though Dropbox (which is pretty fast anyway <3 ). The default cloud images do not ask for sudo passwords, so you don’t need to run the script as root, nor does executing “sudo poweroff” from a script require any special tricks.

But this is not the most convenient solution. What if, you could normally launch your program, and then have another script looking at the CPU utilization in the background? If you are sure your program keeps the CPU loaded up to a certain level, which it should, you can set a threshold and see if the usage stays below that level for a sustained period, then shut down the machine.

Normally, doing this would require doing a moving average of CPU usage over an interval of time – a lot of math to do if you are aiming for a simple shell script. Fortunately, the Unix ‘uptime‘ command automatically does the average for you and reports the instantaneous CPU load, last 5 minute load average, and the last 15 minute load average. There can be no better indicator that your program has finished executing than the 15 minute load average being close to zero. To be sure, you can watch this number while your program executes and make sure your program is properly and consistenetly utilizing the CPU.

Here is the output of the uptime command when some of the cores have been busy:

$ uptime
11:35:03 up 1 day, 1:20, 7 users, load average: 3.08, 3.87, 3.80

And this when it has been relatively idle:

$ uptime
11:36:23 up 11:55,  4 users,  load average: 0.43, 0.26, 0.28

The 15 minute load average is the last number in the output. The idle load varies with the kind of background processes running on the computer, but there is a clear margin between the idle and busy values.

Here is a short script that will compare the 15 minute load average to a set threshold every minute, and shut the machine down if it stays low like that for 10 minutes.

#!/bin/bash

threshold=0.4

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
  res=$(echo $load'<'$threshold | bc -l)
  if (( $res ))
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    sudo poweroff
  fi

  sleep 60

done

Several things can be improved about this script. But for now, this is something you can just take and use without worrying about customizing it or changing your workflow. Just don’t set the threshold so low that the idle state becomes hard to detect, or too high that it seems the machine is idle even when you program is running. 0.4 works great for me when I know my program will be pushing at least one core to the limits during execution (around 0.8 – 0.9 minimum load).

Other approaches that come to mind would be looking for your program in the list of processes, or watching for a file produced by your program at completion. But measuring the CPU utilization is the most generic and reliable method there is.

Using Google Cloud Platform for parallelizing simulations

This post is not about parallel computing. It’s about splitting long jobs across several machines and using Google Cloud Platform and Dropbox to do it cheaply and effectively. We researchers often run simulations that go on for several hours. Many times it’s about running the same job several times but with different parameters for every run. Assuming these are independent jobs, it makes sense to run them in parallel on several machines to save time. The problem here for most people is that they don’t have dozens of powerful machines at their disposal. Google Cloud Platform to the rescue! (Nothing against AWS, but I used up their free credit many years ago before I needed the compute capabilities.)

Cost

Currently, Google Cloud Platform has a promotion that gives you $300 credit to use up within 60 days of signing up. This is a substantial amount to use up in two months – I roughly estimate that it could run five dual core instances with 8G of RAM and 50G hard drives non-stop for two months and still have about $35 of unused credit at the end. Say what? Now if you didn’t need the instances running 24×7, you could possibly have many more of worker machines running in parallel with 4/8 cores and dollops of RAM and SSD storage and still be using the free credit all this time. (Note 1: Google gives discounts on sustained use, just like AWS does, so running a machine for half the time does not exactly halve the bill. Note 2: The free trial puts a limit of 8 cores per zone, so your machines are limited to a maximum of 8 cores. Note 3: There are total 5 zones available, so you may not exceed 40 cores running at a time, which is still 10 quad core machines if you will). With a paid account, you could go even higher.

In my case I needed to run a simulation with five different sets of parameters, and each run took about two hours to finish. I could either run them all on my machine and wait ten hours to get the results, or I could spin up four cloud instances on the side and have everything ready in two. You get the advantage. If there was no free credit, would I still do it? Absolutely! Getting results quickly and moving on to the next thing is priceless!

You do need a credit card to sign up for the free trial, which is a bummer if you are a student. I hope they remove this requirement for students. I have also not compared prices to see if Google is offering cheaper compute resources than Amazon. There are probably other cheaper options available for non-compute uses like hosting small websites.

Dropbox for syncing code and results

I have some Linux experience, so setting up multiple instances was not difficult for me. It might not be easy for others so I’ll talk about how I did things and the tricks I learned. My goal here is to save you time. If your simulations only run on Windows, I’m not sure how helpful this post will be to you – sorry! Where does Dropbox fit in? You guessed it. To synchronize the code and results. Most people have done enough referrals to bump up their quota from the standard 2G that Dropbox offers. For others, there is Google Drive with 15G of free space (BUT NO OFFICIAL LINUX CLIENT! ARE YOU KIDDING ME!?), so if you can get one of the third party Linux clients working satisfactorily, good for you, but I will only stick with Dropbox for this post. I have a paid Dropbox account with 1TB of storage so it wasn’t an issue for me. I can’t thank Dropbox enough for the excellent work they have done.

Performance

cat /proc/cpuinfo tells me that the instance I’m looking at has 4-cores of Xeon CPUs @ 2.30GHz. These are supposed to be dedicated to my instance. Even though my simulations only take about two cores to 100% utilization, they run noticeably slower on the cloud instance than on my local workstation with an 8-core i7-4770 CPU @ 3.40GHz. The workstation can even run a second simulation inside a VM with no major slowdowns. But certainly none of this matters unless I have benchmarks to go with my claims. I will try to post some here later.

Getting started

If you don’t have a Google account (Gmail), you will need to create one. You can then sign up for the free trial or a paid account. Google Cloud Platform is an umbrella under which they provide many services – for hosting apps, running databases, networking, and machine learning, to name a few. We are looking at the Compute Engine component. You can open your Google Cloud Platform console by going to https://console.cloud.google.com/. You must create a new Project which will organize all your compute instances, app engines, databases, etc. related to a project together. Our project will only have compute engine instances, but you must create one to get started.

This is what your console looks like (with a project already open):

Dashboard

You will be automatically prompted to create a new project when you first sign in. Pick a name for your project:

Create project

Wait a few seconds till things are ready. You will see a spinning circle till then which will turn into a notification once it is done.

Wait Wait

Once created, your new project’s dashboard will look like this:

Empty project dashboard

Creating your first instance

Click on the expand menu button and click on Compute Engine:

Create instance 1

You can then click on the Create instance button under VM instances.

Create instance 2

Here you pick a name for the instance (which is also its hostname), a zone (there are limits on how many cores you can be running at the same time in a zone), number of cores and memory, and a boot disk. The default machine types can be customized to change the number of cores and amount of RAM to suit your  needs.

Create instance - provide name, choose hardware specs

For the boot disk, you can choose from the available Linux distros, choose between SSD and HDD, and set a size for the disk. You can also choose a snapshot of a disk you saved earlier.

Select distribution for boot disk

I’ll go with Ubuntu 16.04 LTS and a 20GB standard boot disk. The documentation says that larger disks will have better performance, so if you have I/O intensive tasks, you may want to go for a larger SSD type disk.

As soon as you click on the create button, the machine performs its first boot (to auto-configure the hostname, networking, and other management features) and you start getting billed for it. You can always shut it down when not using it but there is a minimum 10 minute limit for billing.

Running instance with public IP

Once the machine is running, you will see its public IP address. You can SSH into it directly from your web browser by clicking the SSH button, but many people, including myself, don’t like the feel of a javascript console and prefer the real thing. Also, when using the web console, you are automatically signed in to an account named after your Google username which may not be what you want. To use an SSH client to connect via the public IP, you will need to set up your SSH keys.

Setting up SSH keys

You can set up per-instance SSH keys or project-wide SSH keys. I prefer the latter, as once set up, they are automatically installed on all instances in the project. All user acounts corresponding to the installed keys are also created automatically. To set up your SSH key, click on the Metadata section and select the SSH tab. Copy the contents of your public key and paste it into the box (if you don’t have one, you can search the web for how to create an SSH key pair). Your username will be automatically picked up from the key – which is usually the username you have set up on your local machine. Click the save button and you are all set. Click on the VM Instances button to go back.

Project-wide SSH keys

You can now access your machine by simply running:

ssh public_ip

Here is a connection session:

~$ ssh 104.137.152.171

The authenticity of host '104.137.152.171 (104.137.152.171)' can't be established.
ECDSA key fingerprint is SHA256:s5d4f54sd5f4s5
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '104.137.152.171' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-38-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

0 packages can be updated.
0 updates are security updates.



The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

rohit@machine1:~$

You can see the username and hostname on the machine are correctly set up.

Install your software

You should now install all the software you will be running on the machines. Remember that this is your template machine which will be duplicated to create your other instances, so try to keep things clean and complete. If you are running free or open source software, it won’t have complicated licensing mechanisms and the clones will work perfectly fine without the need for re-installing again and again. Even proprietary licensed software installations, like MATLAB, can be duplicated this way but may need re-activation on each instance depending on the license type.

For my personal setup, I installed the C++ compiler, Octave, and R.

sudo apt-get install build-essential 
sudo apt-get install octave octave-image octave-signal 
sudo apt-get install r-base

You can also launch R and install any R packages you need.

Although I have only used tools that can be installed and run without an X server, you can install one and use it headless/remote using VNC.

Install Dropbox

Follow “Dropbox Headless Install via command line (64-bit)” instructions at https://www.dropbox.com/install-linux. Once you have run the commands on that page, you will be asked to authenticate by visiting a URL. The Dropbox daemon will start running and syncing your files.

You can kill the sync process for now by pressing Ctrl-C. We will first download the Dropbox command line tool to monitor Dropbox and also set up the daemon to autostart. We will also set up exclude folders to prevent personal or unnecessary files from syncing to the nodes.

Fetch the Dropbox command line tool

$ mkdir ~/bin
$ cd ~/bin
/bin$ wget --content-disposition https://www.dropbox.com/download?dl=packages/dropbox.py
--2016-10-01 03:17:34--  https://www.dropbox.com/download?dl=packages/dropbox.py
Resolving www.dropbox.com (www.dropbox.com)... 162.125.4.1
Connecting to www.dropbox.com (www.dropbox.com)|162.125.4.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://linux.dropbox.com/packages/dropbox.py [following]
--2016-10-01 03:17:35--  https://linux.dropbox.com/packages/dropbox.py
Resolving linux.dropbox.com (linux.dropbox.com)... 52.84.63.11, 52.84.63.249, 52.84.63.76, ...
Connecting to linux.dropbox.com (linux.dropbox.com)|52.84.63.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 116583 (114K) [application/octet-stream]
Saving to: ‘dropbox.py’

dropbox.py          100%[===================>] 113.85K  --.-KB/s    in 0.02s   

2016-10-01 03:17:35 (5.24 MB/s) - ‘dropbox.py’ saved [116583/116583]

/bin$ chmod +x dropbox.py 
/bin$ cd
$ ~/bin/dropbox.py 
Dropbox command-line interface

commands:

Note: use dropbox help <command> to view usage for a specific command.

 status       get current status of the dropboxd
 throttle     set bandwidth limits for Dropbox
 help         provide help
 puburl       get public url of a file in your dropbox's public folder
 stop         stop dropboxd
 running      return whether dropbox is running
 start        start dropboxd
 filestatus   get current sync status of one or more files
 ls           list directory contents with current sync status
 autostart    automatically start dropbox at login
 exclude      ignores/excludes a directory from syncing
 lansync      enables or disables LAN sync
 sharelink    get a shared link for a file in your dropbox
 proxy        set proxy settings for Dropbox

$ ~/bin/dropbox.py status
Dropbox isn't running!

Dropbox autostart

Currently, the Dropbox daemon is not running, but we will make it run on boot in the background. Unlike other services, the Dropbox daemon should-not/need-not be run as root.  Thus it is best set up to launch as your personal cron job.

$ crontab -e
no crontab for rohit - using an empty one

Select an editor.  To change later, run 'select-editor'.
  1. /bin/ed
  2. /bin/nano        <---- easiest
  3. /usr/bin/vim.basic
  4. /usr/bin/vim.tiny

Choose 1-4 [2]: 3

Add the following line to the end of the file:

@reboot $HOME/.dropbox-dist/dropboxd

Save and exit the editor. Now reboot the machine with:

$ sudo reboot

Wait about a minute to allow the machine to come back up, then ssh into it again. You can now check the Dropbox status:

~$ ~/bin/dropbox.py status
Starting...

And once the sync completes, you should see

~$ ~/bin/dropbox.py status
Up to date.

Exclude personal/unnecessary files

You can exclude folders that contain personal files or ones which are too big to fit on your instances by

~$ cd Dropbox
~/Dropbox$ ~/bin/dropbox.py exclude add "excluded folder 1" "excluded folder 2" "excluded file 1"

You can add as many folders as you want, but be aware that it is a slow process. Also, you will need to re-do the authentication and exclude steps on each instance.

Snapshotting and duplicating your machine

Now we get to the real time-saving part – creating other instances by duplicating this one. On the VM-instances page, you may notice a “Clone” button. But it doesn’t clone the way we expect it to. The clone is only similar in configuration – CPU cores, memory, disk size and distro – but your data is not duplicated! To do an effective clone, we will have to:

  1. “Unconfigure” the machine and turn it off.
  2. Take a snapshot of the disk.
  3. Create new instances using the snapshot.
  4. Re-authenticate Dropbox if needed.

Prepping for the snapshot

Snapshots that have to be deployed to multiple machines usually have things like hostnames, ssh keys, network config etc. erased so that they can be given unique values on each machine. The beauty of Google Cloud Platform is that it handles all these things transparently. The only “unconfiguration” you need to do is of Dropbox. If you blindly clone a Dropbox installation, Dropbox will think that all the clones are the same machine and you will get erratic syncing (the behavior at the time of this writing). Thus each machine should be independently authenticated – which is fast and painless. The data that is already present in the Dropbox folder is not re-downloaded as Dropbox is smart enough not to do that.

First stop the Dropbox daemon and then delete the “~/.dropbox” hidden folder:

$ ~/bin/dropbox.py stop
Dropbox daemon stopped.
~$ rm -rf "~/.dropbox"

Shut down the instance with sudo poweroff and we are ready to take the snapshot.

Taking the snapshot

Click on Snapshots in the sidebar, then click on Create Snapshot.

Create snaphot

Give a name for the snapshot, a description, and for the source disk, select the machine you just created. Click create.

Create snaphot - details and source disk

Note that keeping snapshots is not free. You are charged per GB of snapshot storage, but the prices are insanely low.

Create a new instance using the snapshot

Go back to the VM instances page. Click the Create Instance button.

New instance from snapshot

New instance from snapshot - configure

For the boot disk, click on Change, and this time visit the Snapshots tab. Select the snapshot you made and finish creating the instance.

Use snapshot as boot disk

Wait for the instance to boot up, and you should now have a clone with your data, installed programs, and correctly configured hostname and SSH keys. You can now simply SSH to the new instance.

Re-authenticate Dropbox

Once we are logged in, we would want to register this new machine with Dropbox and enable sync. First we kill the Dropbox daemon running in the background. We then launch it manually to get the auth URL.

~$ pkill dropbox
~$ ~/.dropbox-dist/dropboxd
This computer isn't linked to any Dropbox account...
Please visit https://www.dropbox.com/cli_link_nonce?nonce=lkj8u5h5jee8t to link this device.

Once complete, you can kill the daemon with Ctrl-C and reboot. You can then go ahead and exclude the folders you wish to from this instance.

You can repeat this process to create as many clones as you may need. You can then start/stop those instances, wait for the Dropboxes to sync up, and start your simulations. Your simulation results that are written to Dropbox are also accessible from anywhere, even when the compute nodes are powered down.

Misc. Problems

SSH Warnings

Since Google Cloud Platform recycles their public IP address, it is very likely that your new instance will get the same IP address you got for a different instance earlier. Usually, this is a sign of the server being hijacked, but not in this case. The resolution is given in the error message itself.

~$ ssh 104.197.12.171

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:hjdshf8745j4h53k4jh5I.
Please contact your system administrator.
Add correct host key in /home/rohit/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/rohit/.ssh/known_hosts:70
  remove with:
  ssh-keygen -f "/home/rohit/.ssh/known_hosts" -R 104.197.12.171
ECDSA host key for 104.197.152.171 has changed and you have requested strict checking.
Host key verification failed.
~$ ssh-keygen -f "/home/rohit/.ssh/known_hosts" -R 104.197.12.171
# Host 104.197.12.171 found: line 70
/home/rohit/.ssh/known_hosts updated.
Original contents retained as /home/rohit/.ssh/known_hosts.old
~$ ssh 104.197.12.171
The authenticity of host '104.197.12.171 (104.197.12.171)' can't be established.
ECDSA key fingerprint is SHA256:hj3h45h35jh4hkI.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '104.197.152.171' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-38-generic x86_64)

Broken SSH connections

You will often find your SSH connections getting broken or timed out, especially when you are on WiFi, or if your machine running the SSH client goes to sleep. To keep the sessions active and to keep your simulations running even in the midst of disconnections, use a program like screen. I have also heard good things about tmux, but have not used it yet.

First thing you do after launching an instance and logging on to it is to run screen.

~$ screen

Screen version 4.03.01 (GNU) 28-Jun-15
...
                  [Press Space for next page; Return to end.]

You will be taken to a regular shell where you can do your tasks. Now, if for some reason your SSH connection is broken, create a new SSH session to the instance and resume the screen session.

~$ screen -r

Using screen takes away the scrolling features from the console, but you can press Ctr-A, Esc keys in screen to enter a scrollable mode.

Checking your quotas

During the free trial, you can only run 8 cores in a particular zone (of which there are five: east, west, central, Europe, and Asia). You can check your usage by clicking the Quotas link in the sidebar.

Quotas page

Security

If you or the school/employer you work for has security policies for the storage, transmission, or use of data, code, and anything else, please be sure to follow them. It is not okay to store data on unecrypted disks or in cloud services if your workplace has rules against doing that.

Remember to apply all important security updates, especially ones related to SSL and SSH. Do not open any ports in the cloud platform firewall unless you know what you are doing. Disable password authentication for SSH and install an intrusion prevention tool like fail2ban.

Conclusion

I hope this guide helps you get a head start with using Google Cloud Platform to advance your research. With Dropbox, your most up-to-date code will be available across all your machines, and your results will be too. By cleverly making and using snapshots, you can scale up to dozens of worker machines without breaking a sweat. Here you can see multiple instances running, but not really doing anything CPU intensive.

Cloud console with multiple instances

Remember to power off instances when you won’t be using them for long periods. Happy Cloud Computing!

If you liked this post, please “like” it on LinkedIn: www.linkedin.com/hp/update/6187867846433964032

Thank you!