Welcome to tszip’s documentation!¶
Introduction¶
This is the documentation for tszip, a command line interface and Python API for compressing tskit tree sequence files used by msprime, SLiM, fwdpy11 and tsinfer. Tszip achieves much better compression than is possible using generic compression utilities by building on the zarr and numcodecs packages.
The command line interface follows the design of gzip
closely, so should be immediately familiar. Here we compress a large tree sequence
representing 1000 Genomes chromosome 22 using tszip
and decompress it using
tsunzip
:
$ ls -lh
total 297M
-rw-r--r-- 1 jk jk 297M May 10 14:49 1kg_chr20.trees
$ tszip 1kg_chr20.trees
$ ls -lh
total 46M
-rw-r--r-- 1 jk jk 46M May 10 14:51 1kg_chr20.trees.tsz
$ tsunzip 1kg_chr20.trees.tsz
$ ls -lh
total 297M
-rw-r--r-- 1 jk jk 297M May 10 14:52 1kg_chr20.trees
Installation¶
Tszip can be installed using the standard pip and conda distribution methods.
For pip users, run:
$ python3 -m pip install tszip
For conda users, tszip is distributed using the conda-forge channel. To install, either ensure that conda-forge is listed in your default channels and run:
$ conda install tszip
or explicitly install from conda-forge:
$ conda install -c conda-forge tszip
Python API¶
This page provides detailed documentation for the tszip
Python API.
Usage example¶
Tszip can be used directly in Python to provide seamless compression and
decompression of tree sequences files. Here, we run an msprime simulation
and write the output to a .trees.tsz
file:
import msprime
import tszip
ts = msprime.simulate(10, random_seed=1)
tszip.compress(ts, "simulation.trees.tsz")
# Later, we load the same tree sequence from the compressed file.
ts = tszip.decompress("simulation.trees.tsz")
Note
For very small simulations like this example, the tszip file may be larger than the original uncompressed file.
API¶
-
tszip.
compress
(ts, destination, variants_only=False)[source]¶ Compresses the specified tree sequence and writes it to the specified path or file-like object. By default, fully lossless compression is used so that tree sequences are identical before and after compression. By specifying the
variants_only
option, a lossy compression can be used, which discards any information that is not needed to represent the variants (which are stored losslessly).Parameters: - ts (tskit.TreeSequence) – The input tree sequence.
- destination (str) – The string,
pathlib.Path
or file-like object we should write the compressed file to. - variants_only (bool) – If True, discard all information not necessary to represent the variants in the input file.
-
tszip.
decompress
(path)[source]¶ Decompresses the tszip compressed file and returns a tskit tree sequence instance.
Parameters: path (str) – The location of the tszip compressed file to load. Return type: tskit.TreeSequence Returns: A tskit.TreeSequence
instance corresponding to the the specified file.
Command line interface¶
Tszip is intended to be used primarily as a command line interface.
The interface for tszip is modelled directly on
gzip, and so
it should hopefully be immediately familiar and useful to many people.
Tszip automatically installs the tszip
and tsunzip
programs,
but depending on your setup, these may not be on your PATH
. A
slightly less convenient (but reliable) method of running tszip is the
following:
$ python3 -m tszip
Online help is available using the --help
option.
The tsunzip
program is an alias for tszip -d
.
tszip¶
Compress/decompress tskit trees files.
usage: tszip [-h] [-V] [-v] [--variants-only] [-S SUFFIX] [-k] [-f] [-c]
[-d | -l]
files [files ...]
Positional Arguments¶
files | The files to compress/decompress. |
Named Arguments¶
-V, --version | show program’s version number and exit |
-v, --verbosity | |
Increase the verbosity Default: 0 | |
--variants-only | |
Lossy compression; throws out information not needed to represent variants Default: False | |
-S, --suffix | Use suffix SUFFIX on compressed files Default: “.tsz” |
-k, --keep | Keep (don’t delete) input files Default: False |
-f, --force | Force overwrite of output file Default: False |
-c, --stdout | Write to stdout Default: False |
-d, --decompress | |
Decompress Default: False | |
-l, --list | List contents of the file Default: False |
Development¶
If you would like to add some features to tszip
, please read the
following. If you think there is anything missing,
please open an issue or
pull request on GitHub!
Workflow¶
An overview of the workflow is at the tskit docs. Note the sections on running tests here.
Changelog¶
[0.2.2] - 2022-02-22¶
- Support compressing to stdout (aabiddanda, #53, #64)
[0.2.1] - 2022-01-26¶
- Fix for time_units in tskit 0.4.0 (benjeffery, #54, #55)
- Add support for reference sequence (benjeffery, #59)
[0.2.0] - 2021-11-08¶
- Support decompressing to stdout. (aabiddanda, #44).
- Add support for new columns in tskit. (benjeffery, #39, #42).
[0.1.0] - 2019-05-10¶
Initial version