Distributing your Python package

Packaging

Packaging means creating a package that can be installed by pip.

There are many ways to create an installable package, and many ways to distribute it.

We will show how to create a package using hatchling, and how to distribute it on GitHub, PyPI and a private PyPI server.

Benefits of packaging

  • Distribute your package to others
  • Install your package with pip
  • Specify dependencies
  • Reproducibility
  • Specify version
  • Release vs. development versions

Packaging workflow

  1. Create a pyproject.toml in the root folder of the project
  2. Build a package (e.g. myproject-0.1.0-py3-none-any.whl)
  3. Upload the package to location, where others can find it

pyproject.toml

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "my_library"
version = "0.0.1"
dependencies = [
    "numpy"
]

authors = [
  { name="First Last", email="initials@dhigroup.com" },
]
description = "Useful library"
readme = "README.md"
requires-python = ">=3.9"
classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Development Status :: 2 - Pre-Alpha",
    "Operating System :: OS Independent",
    "Intended Audience :: Science/Research",
    "Topic :: Scientific/Engineering",
]

[project.optional-dependencies]
dev = ["pytest", "ruff", "mypy", "mkdocs", "mkdocstrings[python]", "mkdocs-material"]
test= ["pytest"]

[project.urls]
"Homepage" = "https://github.com/DHI/my_library"
"Bug Tracker" = "https://github.com/DHI/my_library/issues"

Versioning

Versioning your package is important for reproducibility and to avoid breaking changes.

. . .

Semantic versioning use three numbers {major}.{minor}.{patch}, e.g. 1.1.0

  • A new major version indicates breaking changes
  • A new minor version indicates new features, without breaking changes
  • A new patch version indicates a small change, e.g. a bug fix
  • Each of the numbers can be higher than 9, e.g. 1.0.0 is more recent than 0.24.12

Version 1.0

  • A version number of 1.0 indicates that the package is ready for production
  • The API is stable, and breaking changes will only be introduced in new major versions
  • The package is well tested, and the documentation is complete
  • Start with version 0.1.0 and increase the version number as you add features

Breaking changes

What is a breaking change?

  • Removing a function
  • Changing the name of a function
  • Changing the signature of a function (arguments, types, return value)

. . .

Try to avoid breaking changes, if possible, but if you do, increase the major version number!

Installing specific versions

  • pip install my_library will install the latest version
  • pip install my_library==1.0.0 will install version 1.0.0
  • pip install my_library>=1.0.0 will install version 1.0.0 or higher

Pre-release versions

  • Versions that are not ready for production
  • Indicated by a suffix, e.g. 1.0.0rc1
  • Will not be installed by default
  • Can be installed with pip install my_library==1.0.0rc1
  • Listed on PyPI, but not on the search page

License

  • A license is a legal agreement between you and others who use your package
  • If you do not specify a license, others cannot use your package legally
  • The license is specified in the pyproject.toml file
  • Read more about licenses on https://choosealicense.com/
  • Check if your package is compatible with the license of the dependencies

Dependencies

Application

A program that is run by a user

  • command line tool
  • script
  • web application

Pin versions to ensure reproducibility, e.g. numpy==1.11.0

Library

A program that is used by another program

  • Python package
  • Low level library (C, Fortran, Rust, โ€ฆ)

Make the requirements as loose as possible, e.g. numpy>=1.11.0

Make the requirements loose, to avoid conflicts with other packages.

pyproject.toml

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "my_library"
version = "0.0.1"
dependencies = [
    "numpy"
]

authors = [
  { name="First Last", email="initials@dhigroup.com" },
]
description = "Useful library"
readme = "README.md"
requires-python = ">=3.9"
classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Development Status :: 2 - Pre-Alpha",
    "Operating System :: OS Independent",
    "Intended Audience :: Science/Research",
    "Topic :: Scientific/Engineering",
]

[project.optional-dependencies]
dev = ["pytest", "ruff", "mypy", "mkdocs", "mkdocstrings[python]", "mkdocs-material"]
test= ["pytest"]

[project.urls]
"Homepage" = "https://github.com/DHI/my_library"
"Bug Tracker" = "https://github.com/DHI/my_library/issues"
  • Mandatory dependencies are specified in the dependencies section.
  • Optional dependencies are specified in the optional-dependencies section.

Classifiers

classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Development Status :: 2 - Pre-Alpha",
    "Operating System :: OS Independent",
    "Intended Audience :: Science/Research",
    "Topic :: Scientific/Engineering",
]
  • Classifiers are used to categorize your package
  • Less relevant for internal packages
  • Operating system (Windows, Linux, MacOS)
  • Development status (Alpha, Beta, Production/Stable)

Packaging non-Python files

  • Including non-Python files can be useful for e.g. machine learning models.
  • If you use hatchling, you can include non-Python files in your package.
  • hatchling uses .gitignore to determine which files to include.

GitHub secrets

  • Store sensitive information, e.g. passwords, in your repository.
  • Secrets are encrypted, and only visible to you and GitHub Actions.
  • Add secrets in the repository settings.

To use secrets as environment variables in GitHub Actions, add them to the env section of the workflow:

env:
  USERNAME: ${{ secrets.USERNAME }}
  PASSWORD: ${{ secrets.PASSWORD }}

GitHub Actions

  • Modern publishing uses Trusted Publishers.
  • Avoids managing secrets in the repository.
.github/workflows/python_publish.yml
name: Publish Python Package
on:
  release:
    types: [created]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install build
    - name: Build package
      run: python -m build
    - name: Publish package distributions to PyPI
      uses: pypa/gh-action-pypi-publish@release/v1

Publishing to PyPI and TestPyPI

  • PyPI is the Python Package Index, where you can publish your package.
  • TestPyPI is a test version of PyPI for trying out publish workflows.
  • Only publish to PyPI when you are ready to release a new version.

. . .

Example:

.github/workflows/python_publish.yml
    - name: Build package
      run: python -m build
    - name: Publish package distributions to PyPI
      uses: pypa/gh-action-pypi-publish@release/v1
      with:
          repository-url: https://test.pypi.org/legacy/

Private PyPI server

  • Private packages can be be hosted on e.g. Azure Arfifacts or Posit Package Manager.
  • These servers behaves like PyPI, and can be used with pip
  • Access policies can be used to control who can install packages.

. . .

Example:

$ pip install --extra-index-url https://pkgs.dev.azure.com/dhigroup/_packaging/pond/pypi/simple/ sampling
Looking in indexes: https://pypi.org/simple, https://pkgs.dev.azure.com/dhigroup/_packaging/pond/pypi/simple/
...
Successfully installed sampling-0.0.1

Installing a development version

  • Install latest dev version, e.g. pip install https://github.com/DHI/mikeio/archive/main.zip
  • Install from fix-interp branch, e.g. pip install https://github.com/DHI/mikeio/archive/fix-interp.zip

Recap

  1. Git, Pull Requests, and code reviews ๐Ÿ›
  2. Python functions, classes, and modules ๐Ÿ›
  3. Testing and auto-formatting ๐Ÿ›
  4. Dependencies and GitHub actions ๐Ÿ›
  5. Object oriented design in Python ๐Ÿ›
  6. Documentation ๐Ÿ›
  7. Distributing your package ๐Ÿ›

Git, Pull Requests, and code reviews

Github flow

  1. Create a branch
  2. Make changes
  3. Create a pull request
  4. Review
  5. Merge

Github best practices

  • Commit often
  • Use descriptive commit messages
  • Keep pull requests small and focused
  • Use โ€œissuesโ€ to track work
  • Review code regularly

Python functions, classes, and modules

Functions as black boxes

flowchart LR
    A(Input A) --> F["Black box"]
    B(Input B) -->  F
    F --> O(Output)

    style F fill:#000,color:#fff,stroke:#333,stroke-width:4px

  • A function is a black box that takes some input and produces some output.
  • The input and output can be anything, including other functions.
  • As long as the input and output are the same, the function body can be modified.

Naming conventions - general

  • Use lowercase characters
  • Separate words with underscores
model_name = "NorthSeaModel"
n_epochs = 100

def my_function():
    pass

Constants

  • Use all uppercase characters
GRAVITY = 9.81

AVOGADRO_CONSTANT = 6.02214076e23

SECONDS_IN_A_DAY = 86400

N_LEGS_PER_ANIMAL = {
    "human": 2,
    "dog": 4,
    "spider": 8,
}

Classes

  • Use CamelCase for the name of the class
  • Use lowercase characters for the name of the methods
  • Separate words with underscores
class RandomClassifier:

    def fit(self, X, y):
        self.classes_ = np.unique(y)

    def predict(self, X):
        return np.random.choice(self.classes_, size=len(X))

    def fit_predict(self, X, y):
        self.fit(X, y)
        return self.predict(X)

Dataclasses

import datetime
from dataclasses import dataclass


@dataclass
class Interval:
    start: date
    end: date

>>> dr1 = Interval(start=datetime.date(2020, 1, 1), end=datetime.date(2020, 1, 31))
>>> dr1
Interval(start=datetime.date(2020, 1, 1), end=datetime.date(2020, 1, 31))
>>> dr2 = Interval(start=datetime.date(2020, 1, 1), end=datetime.date(2020, 1, 31))
>>> dr1 == dr2
True

Types, abstraction, and refactoring

Pythonic

If you want your code to be Pythonic, you have to be familiar with these types and their methods.

Dundermethods:

  • __getitem__
  • __setitem__
  • __len__
  • __contains__
  • โ€ฆ

class JavaLikeToolbox:
    
    def getToolByName(self, name: str) -> Tool:
        for tool in self.tools:
            if tool.name == name:
                return tool

    def numberOfTools(self) -> int:
        return len(self.tools)

>>> tb = JavaLikeToolbox([Hammer(), Screwdriver()])
>>> tb.getToolByName("hammer")
Hammer()
>>> tb.numberOfTools()
2

class Toolbox:

    def __getitem__(self, name: str) -> Tool:
        return self._tools[name]
    
    def __len__(self) -> int:
        return len(self.tools)

>>> tb = Toolbox([Hammer(), Screwdriver()])
>>> tb["hammer"]
Hammer()
>>> len(tb)
2

Duck typing

  • โ€œIf it walks like a duck and quacks like a duck, itโ€™s a duckโ€
  • From the perspective of the caller, it doesnโ€™t matter if it is a rubber duck or a real duck.
  • The type of the object is not important, as long as it has the right methods.

An example is a Scikit learn transformers

  • fit
  • transform
  • fit_transform

If you want to make a transformer compatible with sklearn, you have to implement these methods.


class MyTransformer:
    def fit(self, X, y=None):
        # do something
        return self

    def transform(self, X):
        # do something
        return X

    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

Testing and auto-formatting

Unit testing

Definition โ€œUnitโ€
  • A small, fundamental piece of code.
  • Executed in isolation with appropriate inputs.
  • A function is typically considered a โ€œunitโ€
  • Lines of code within functions are smaller (canโ€™t be isolated)
  • Classes are considered bigger (but can be treated as units)

A good unit test

  • Fully automated
  • Has full control over all the pieces running (โ€œfakeโ€ external dependencies)
  • Can be run in any order
  • Runs in memory (no DB or file access, for example)
  • Consistently returns the same result (no random numbers)
  • Runs fast
  • Tests a single logical concept in the system
  • Readable
  • Maintainable
  • Trustworthy

Thank you!