Ray in the Google cloud – part 1: cloud infrastructure

Ray in the Google cloud – part 1: cloud infrastructure

In a previous blog post, I have shown you how to set up a Ray locally, more specifically under Windows. But the real beauty of Ray lies in its ability to swiftly move from a local environment where your development speed is high, to a cloud environment where you have access to much more compute power. In this blog post, I will show you how to set up a Ray in the Google cloud. You don’t need to have a local installation first. If you don’t want to, you can start in the cloud right away.

Overall approach: security first, powered by Google cloud mechanisms

When using Ray locally, we don’t have to worry about security too much. This is different in the cloud. Especially the design of the network setup must start from security considerations.

Ray needs a lot of open ports for communication between the cluster machines, and it internally uses a Redis instance whose ports need to be shielded from unauthorized access. This remains true when the Redis is protected by a password. Fortunately, the Google cloud offers some special mechanisms that help us build a secure networking environment for Ray. Our approach is the following:

We configure a virtual private network (VPC) that is exclusively used by Ray.
We use only internal IP addresses, i.e., none of the virtual machines (VMs) we’re using will have an external IP address. Consequently, none of these machines will be reachable from the internet.
To be able to communicate with the cluster, we use Google’s identity aware proxy (IAP).
To view the Ray dashboard, we use the web preview that is integrated into the Google Cloud shell.

Setting up the network for our Ray cluster

In this step, we set up the VPN for our cluster. You can either use cloud console, cloud shell or Terraform to do that. If you'd like to use cloud shell, the repo contains a file with the commands needed. We need to set up a VPN, a subnetwork in the region in which we would like to deploy our cluster (I chose europe-west1 for this, but you may have a different preference) and add some firewall rules. Before using the script, you will have to change the project ID (placeholder “<your-project-id>”), and you may also want to choose another region. If you like, you can also change the names of the network and subnetwork. The latter is advisable if you change the region, as the region name is currently part of the subnetwork name.

The firewall rules open all ports for TCP and UDP communication within the cluster (i.e., within the VPN). Additionally, we add a firewall rule that allows SSH communication with Google's identity-aware proxy (IAP). The IAP uses the IP range 35.235.240.0/20. With this firewall rule, you can connect to any machine within the VPN just by clicking "SSH" next to the VM's menu point in the cloud console. It will also enable us to connect Google’s web preview to the Ray dashboard on the cluster through an IAP tunnel.

Finally, add a CloudNAT to the network (in the same region where your subnetwork is located) to enable the machines to download software from the internet. The overall setup is shown in the figure on the left.

Using a client VM

A client machine is a VM that does not belong to the cluster (although it’s in the same subnetwork) and is only used to launch the cluster, issue jobs, and then take down the cluster again. You could do this from your local machine, but there is some preparation involved that is slightly more convenient in the cloud: you need to install Ray (easy), but you also have to make sure you install it in an environment with same Python version as the one used in the cluster. And "the same" really means the same: even the microversion needs to be identical. That's the number after the second dot, so if your cluster runs, say, Python 3.7.12, your client also needs 3.7.12. In the cloud, you can synchronize Python versions between cluster and client very conveniently: You just use the same VM image for the cluster machines and the client. If you use the cloud console to set up your VM, you find this image when you look for "Deep learning on Linux" in the selection for Operating Systems. There are several versions of this one, but we just leave the default unchanged. It's a bit of an overkill to install this large image on the client, but it makes sure you always have the same Python version on the client and the cluster. For the client machine, something small like n1-standard-1 is usually enough. Make sure you don’t forget to set the network interface to the network and subnetwork you just created and select “None” for external IP. Under “Identity and API access” select “Allow full access to all Cloud APIs”. This is needed because we want to allow our VM to launch other VMs. If you want to, you can also launch the VM via cloud shell using this script. Just remember to replace at least the “<your-project-id>” placeholder (it occurs twice!) before using, and possible the region, if you changed it above.

Install Ray on the client VM

To install Ray, log into the client machine via SSH, using the SSH button in the cloud console next to the VM itself. First make sure you have a current version of pip by issuing

pip install --upgrade pip

Then install Ray itself with

pip install ‘ray[default]’

At this point, pip might warn you about a dependency conflict involving grpcio, but you can safely ignore that.

What’s next?
We’ve now set up the cloud infrastructure we need. Please join me again in part 2, where we will adapt the cluster definition to our needs and take the new cluster for a ride.

Read part 2