Python Semantic Release’s Version Bumping Algorithm

Below is a technical description of the algorithm which Python Semantic Release uses to calculate a new version for a project.

Assumptions

  • At runtime, we are in a Git repository with HEAD referring to a commit on some branch of the repository (i.e. not in detached HEAD state).

  • We know in advance whether we want to produce a prerelease or not (based on the configuration and command-line flags).

  • We can parse the tags of the repository into semantic versions, as we are given the format that those Git tags should follow via configuration, but cannot cherry-pick only tags that apply to commits on specific branches. We must parse all tags in order to ensure we have parsed any that might apply to commits in this branch’s history.

  • If we can identify a commit as a merge-base between our HEAD commit and one or more tags, then that merge-base should be unique.

  • We know ahead of time what prerelease_token to use for prereleases - e.g. rc.

  • We know ahead of time whether major changes introduced by commits should cause the new version to remain on 0.y.z if the project is already on a 0. version - see major_on_zero.

Implementation

  1. Parse all the Git tags of the repository into semantic versions, and sort in descending (most recent first) order according to semver precedence. Ignore any tags which do not correspond to valid semantic vesrions according to tag_format.

  2. Find the merge-base of HEAD and the latest tag according to the sort above. Call this commit M. If there are no tags in the repo’s history, we set M=HEAD.

  3. Find the latest non-prerelease version whose tag references a commit that is an ancestor of M. We do this via a breadth-first search through the commit lineage, starting against M, and for each tag checking if the tag corresponds to that commit. We break from the search when we find such a tag. If no such tag is found, see 4a). Else, suppose that tag corresponds to a commit L - goto 4b).

    1. If no commit corresponding to the last non-prerelease version is found, the entire history of the repository is considered. We parse every commit that is an ancestor of HEAD to determine the type of change introduced - either major, minor, patch, prerelease_revision or no_release. We store this levels in a set as we only require the distinct types of change that were introduced.

    2. However, if we found a commit L which is the commit against which the last non-prerelease was tagged, then we parse only the commits from HEAD as far back as L, to understand what changes have been introduced since the previous non-prerelease. We store these levels - either major, minor, patch, prerelease_revision, or no_release, in a set, as we only require the distinct types of change that were introduced.

    3. We look for tags that correspond to each commit during this process, to identify the latest pre-release that was made within HEAD’s ancestry.

  4. If there have been no changes since the last non-prerelease, or all commits since that release result in a no_release type according to the commit parser, then we terminate the algorithm.

  5. If we have not exited by this point, we know the following information:

  • The latest version, by semver precedence, within the whole repository. Call this LV. This might not be within the ancestry of HEAD.

  • The latest version, prerelease or non-prerelease, within the whole repository. Call this LVH. This might not be within the ancestry of HEAD. This may be the same as LV.

  • The latest non-prerelease version within the ancestry of HEAD. Call this LVHF. This may be the same as LVH.

  • The most significant type of change introduced by the commits since the previous full release. Call this level

  • Whether or not we wish to produce a prerelease from this version increment. Call this a boolean flag, prerelease. (Assumption)

  • Whether or not to increment the major digit if a major change is introduced against an existing 0. version. Call this major_on_zero, a boolean flag. (Assumption)

Using this information, the new version is decided according to the following criteria:

  1. If LV has a major digit of 0, major_on_zero is False and level is major, reduce level to minor.

  2. If prerelease=True, then

    1. Diff LV with LVHF, to understand if the major, minor or patch digits have changed. For example, diffing 1.2.1 and 1.2.0 is a patch diff, while diffing 2.1.1 and 1.17.2 is a major diff. Call this DIFF

    2. If DIFF is less semantically significant than level, for example if DIFF=patch and level=minor, then

      1. Increment the digit of LVF corresponding to level, for example the minor digit if level=minor, setting all less significant digits to zero.

      2. Add prerelease_token as a suffix result of 1., together with a prerelease revision number of 1. Return this new version and terminate the algorithm.

      Thus if DIFF=patch, level=minor, prerelease=True, prerelease_token="rc", and LVF=1.1.1, then the version returned by the algorithm is 1.2.0-rc.1.

    3. If DIFF is semantically less significant than or equally significant to level, then this means that the significance of change introduced by level is already reflected in a prerelease version that has been created since the last full release. For example, if LVHF=1.1.1, LV=1.2.0-rc.1 and level=minor.

      In this case we:

      1. If the prerelease token of LV is different from prerelease_token, take the major, minor and patch digits of LV and construct a prerelease version using our given prerelease_token and a prerelease revision of 1. We then return this version and terminate the algorithm.

        For example, if LV=1.2.0-rc.1 and prerelease_token=alpha, we return 1.2.0-alpha.1.

      2. If the prerelease token of LV is the same as prerelease_token, we increment the revision number of LV, return this version, and

        terminate the algorithm. For example, if LV=1.2.0-rc.1 and prerelease_token=rc, we return 1.2.0-rc.2.

  3. If prerelease=False, then

  1. If LV is not a prerelease, then we increment the digit of LV corresponding to level, for example the minor digit if level=minor, setting all less significant digits to zero. We return the result of this and terminate the algorithm.

  2. If LV is a prerelease, then:

  1. Diff LV with LVHF, to understand if the major, minor or patch digits have changed. Call this DIFF

  2. If DIFF is less semantically significant than level, then

  1. Increment the digit of LV corresponding to level, for example the minor digit if level=minor, setting all less significant digits to zero.

  2. Remove the prerelease token and revision number from the result of i., (“Finalize” the result of i.) return the result and terminate the algorithm.

    For example, if LV=1.2.2-alpha.1 and level=minor, we return 1.3.0.

  1. If DIFF is semantically less significant than or equally significant to level, then we finalize LV, return the result and terminate the algorithm.

Complexity

Space:

A list of parsed tags takes O(number of tags) in space. Parsing each commit during the breadth-first search between merge-base and the latest tag in the ancestry of HEAD takes at worst O(number of commits) in space to track visited commits. Therefore worst-case space complexity will be linear in the number of commits in the repo, unless the number of tags significantly exceeds the number of commits (in which case it will be linear in the number of tags).

Time:

Assuming using regular expression parsing of each tag is a constant-time operation, then the following steps contribute to the time complexity of the algorithm:

  • Parsing each tag - O(number of tags)

  • Sorting tags by semver precedence - O(number of tags * log(number of tags))

  • Finding the merge-base of HEAD and the latest release tag - O(number of commits) (worst case)

  • Parsing each commit and checking each tag against each commit - O(number of commits) + O(number of tags * number of commits) (worst case)

Overall, assuming that the number of tags is less than or equal to the number of commits in the repository, this would lead to a worst-case time complexity that’s quadratic in the number of commits in the repo.