Welcome to tszip’s documentation!

Introduction

This is the documentation for tszip, a command line interface and Python API for compressing tskit tree sequence files used by msprime, SLiM, fwdpy11 and tsinfer. Tszip achieves much better compression than is possible using generic compression utilities by building on the zarr and numcodecs packages.

The command line interface follows the design of gzip closely, so should be immediately familiar. Here we compress a large tree sequence representing 1000 Genomes chromosome 22 using tszip and decompress it using tsunzip:

$ ls -lh
total 297M
-rw-r--r-- 1 jk jk 297M May 10 14:49 1kg_chr20.trees
$ tszip 1kg_chr20.trees
$ ls -lh
total 46M
-rw-r--r-- 1 jk jk 46M May 10 14:51 1kg_chr20.trees.tsz
$ tsunzip 1kg_chr20.trees.tsz
$ ls -lh
total 297M
-rw-r--r-- 1 jk jk 297M May 10 14:52 1kg_chr20.trees

Installation

Tszip can be installed using the standard pip and conda distribution methods.

For pip users, run:

$ python3 -m pip install tszip

For conda users, tszip is distributed using the conda-forge channel. To install, either ensure that conda-forge is listed in your default channels and run:

$ conda install tszip

or explicitly install from conda-forge:

$ conda install -c conda-forge tszip

Python API

This page provides detailed documentation for the tszip Python API.

Usage example

Tszip can be used directly in Python to provide seamless compression and decompression of tree sequences files. Here, we run an msprime simulation and write the output to a .trees.tsz file:

import msprime
import tszip

ts = msprime.simulate(10, random_seed=1)
tszip.compress(ts, "simulation.trees.tsz")

# Later, we load the same tree sequence from the compressed file.
ts = tszip.decompress("simulation.trees.tsz")

Note

For very small simulations like this example, the tszip file may be larger than the original uncompressed file.

API

tszip.compress(ts, destination, variants_only=False)[source]

Compresses the specified tree sequence and writes it to the specified path or file-like object. By default, fully lossless compression is used so that tree sequences are identical before and after compression. By specifying the variants_only option, a lossy compression can be used, which discards any information that is not needed to represent the variants (which are stored losslessly).

Parameters:
  • ts (tskit.TreeSequence) – The input tree sequence.
  • destination (str) – The string, pathlib.Path or file-like object we should write the compressed file to.
  • variants_only (bool) – If True, discard all information not necessary to represent the variants in the input file.
tszip.decompress(path)[source]

Decompresses the tszip compressed file and returns a tskit tree sequence instance.

Parameters:path (str) – The location of the tszip compressed file to load.
Return type:tskit.TreeSequence
Returns:A tskit.TreeSequence instance corresponding to the the specified file.

Command line interface

Tszip is intended to be used primarily as a command line interface. The interface for tszip is modelled directly on gzip, and so it should hopefully be immediately familiar and useful to many people. Tszip automatically installs the tszip and tsunzip programs, but depending on your setup, these may not be on your PATH. A slightly less convenient (but reliable) method of running tszip is the following:

$ python3 -m tszip

Online help is available using the --help option.

The tsunzip program is an alias for tszip -d.

tszip

Compress/decompress tskit trees files.

usage: tszip [-h] [-V] [-v] [--variants-only] [-S SUFFIX] [-k] [-f] [-c]
             [-d | -l]
             files [files ...]

Positional Arguments

files The files to compress/decompress.

Named Arguments

-V, --version show program’s version number and exit
-v, --verbosity
 

Increase the verbosity

Default: 0

--variants-only
 

Lossy compression; throws out information not needed to represent variants

Default: False

-S, --suffix

Use suffix SUFFIX on compressed files

Default: “.tsz”

-k, --keep

Keep (don’t delete) input files

Default: False

-f, --force

Force overwrite of output file

Default: False

-c, --stdout

Write to stdout

Default: False

-d, --decompress
 

Decompress

Default: False

-l, --list

List contents of the file

Default: False

Development

If you would like to add some features to tszip, please read the following. If you think there is anything missing, please open an issue or pull request on GitHub!

Workflow

An overview of the workflow is at the tskit docs. Note the sections on running tests here.

Changelog

[0.2.2] - 2022-02-22

  • Support compressing to stdout (aabiddanda, #53, #64)

[0.2.1] - 2022-01-26

  • Fix for time_units in tskit 0.4.0 (benjeffery, #54, #55)
  • Add support for reference sequence (benjeffery, #59)

[0.2.0] - 2021-11-08

  • Support decompressing to stdout. (aabiddanda, #44).
  • Add support for new columns in tskit. (benjeffery, #39, #42).

[0.1.0] - 2019-05-10

Initial version

Indices and tables