Skip to main content

As data scientists, we regularly train different machine-learning models in the cloud. In parts 1 and 2 of this blog series, you can read about how to set up ML pipelines and what their advantages are. Here you can find out how to structure your model training with the help of Python packages. Although each model has its own specific, intended application, some code snippets are ultimately copied from one project to another. In my case, this code is often for reading data from a database or for a pre-processing step. By allowing frequently used functions to be collected in one place, Python packages are ideal for avoiding this kind of code copying. This offers many advantages in the maintenance and testing of code.

In this blog article, we will see how a Python package can be utilized in GCP and integrated into a Vertex AI training job.

Creating a Python package

Before being able to make our Python code available as a package, we need to make sure that our Python module meets the requirements for this:

  1. The module should have at least one of the files setup.py, setup.cfg or pyproject.toml. These can be used individually or in combination to define how the Python package should be installed later. Prerequisites such as Python version >= 3.9 can be specified in this manner, for example.
  2. The code should have a folder structure as shown in the following snippet: There is a main folder containing all configuration files and a subfolder. The subfolder contains the actual Python code.
structure of package:
├── setup.py # or setup.cfg or pyproject.toml
├── my_package
│      ├── __init__.py
│      └── example.py

After ensuring that these requirements have been met, we can create package distributions from our module:

cd my-package
python3 -m pip install --upgrade build
python3 -m build
ls ./dist

This code installs the native Python build tool for us, and uses it to create the Python distributions. The result is a WHEEL file as well as a TAR archive.

Setting up Google's artifact registry

Google's artifact registry offers a complete solution for images and code libraries. We will use it to version and manage our Python packages. The registry can be easily created via the UI or with gcloud.

Integrating Google's artifact registry

Before being able to load our package into the registry, we have to make a few more preparations:

  1. First of all, a pypirc file is needed. This file contains specifications for uploading packages to private registries. Here we list our newly created artifact registry and specify its URL.
    # ~/.pypirc [distutils] index-servers =     my-repository [my-repository] repository = <PYTHON-REGISTRY-URL>
  2. Now we need to obtain authority to access Google's artifact registry. This is done via Python's keyring service. For this we also need Google's keyring library which allows us to use our GCP credentials for login. After logging into gcloud and installing the library, we no longer have to bother about access rights.
    gcloud auth login python3 -m pip install keyrings.google-artifactregistry-auth

Uploading the package to Google's artifact registry

Our distributions have been built, our registry is ready and we are authorized for access. Now we can load our package into the registry.  This is done with Python's standard tool Twine. After installing Twine, we can upload the package to the registry via the command line.


python3 -m pip install twine
twine upload --repository-url <PYTHON-REGISTRY-URL> ./dist/*

Done! Our package is in the cloud. From now on, it is accessible for all our services.

Use of the private package in a docker container

Now we can use the package directly in a new service. The easiest way to do this is in container-based solutions such as Vertex AI training jobs with custom containers.

For this purpose, we list the package in the docker service's requirements. Here we just have to make sure to specify our registry's URL. This tells tools like Pip where to look for listed dependencies.

Important! The URL requires the "/simple" suffix. This tells dependency management tools (pip) how to communicate with the server. For more details, refer to PEP 503.

# requirements.txt
--extra-index-url <PYTHON-REGISTRY-URL>/simple/

my-package
...

In the docker build process, it is then necessary to install Google's keyring library again. This also provides the docker daemon with the rights to communicate with the registry.

# Dockerfile
FROM python:3.9-slim

WORKDIR /app
COPY ./train.py .

RUN pip install keyrings.google-artifactregistry-auth
RUN pip install -r requirements.txt

CMD python ./train.py

Finished! The image can be built and pushed.

Conclusion

We have just seen how to make Python packages available with Google's cloud registry, and how they can be used from Vertex AI. Which functions do you often copy from one project to another? A perfect starting point for cleaning up here is to put the code into a package and make it usable for your future projects.

 

 

 

Do you need help setting up your registries? Do not hesitate to contact us - whether it's about Python or docker containers.
 

Contact us!

 

 

 

 

Your Contact
Laurenz Reitsam
Consultant
Laurenz is a data scientist with a keen interest in DevOps and infrastructure as well as machine learning and data analytics. It is his firm belief that a model is only a good model if it succeeds in making its way into production.
#Pythonist #GCP #DataScience