Deploying a private package repository on Google App Engine
The introduction of pip for managing the dependencies of Python projects greatly reduced the headache previously associated with this task. Thanks to an apt-get style of searching, installing and removing packages, the requirements.txt file, and integration with setuptools the tool integrates quite effortless in most projects. To find and download packages pip contacts PyPI (the Python Package Index), a global public repository hosted by the Python foundation. This works great for open source projects but is not an option for closed source projects. Fortunately, pip supports contacting different sources. One option is to specify dependency links and directly point them to tarballs of your source code repository, if available. This option however used to be deprecated, it makes version management a little odd as you manually specify the version in the dependency URL, and as your source code repository is often private exposing it to cloud services to install dependencies can introduce some additional authentication and networking complexity. Fortunately, pip also permits setting up a private package index you set up yourself.
Private package index hosting
Several tools to accomplish this setup are available such as gemfury and simplepypi. These solutions however require you to make a dedicated host available: this can be your own computer or an internal development server, but this makes deploying your private packages in cloud projects difficult. On cloud environments, you can remedy this by deploying an instance, but this isn’t the most cost-effective path: an example is s3pypi which uses the static hosting features of S3 buckets instead of an EC2 instance which is several magnitudes cheaper. Internally we developed a tool named GAEPyPI with a similar purpose, but for Google cloud environments instead as this is where many of the systems developed by ML2Grow reside. Today we are open-sourcing this project as ML2Grow strongly believes in the merits of open-source software: it has been a fundamental prerequisite for the incredible growth in data science and machine learning technology of recent years.
Like s3pypi our solution stores uploaded packages on Google Cloud Storage (GCS) which is again, extremely cheap. However, GAEPyPI stores them privately and exposes the packages to the web through a small application intended to be deployed on Google App Engine (GAE). This choice comes with no additional cost (unless your private package index gets really popular and generates large amounts of traffic) and offers the use of HTTP authentication (supported by pip!) as a layer to keep your packages private. On top of that, Google App Engine recently introduced painless support for SSL: even without a certificate you can easily add this layer of security. This means all you need to get started is:
- A project on Google cloud
- A subdomain (e.g., pypi.yourcompany.com)
To deploy, clone the repository and follow the instructions of the readme to get deploying your code in no time, using the same commands you are used to from PyPI.