Skip to content

Commit 179963c

Browse files
authored
Documentation/Consistency Review (#3329)
* Check table sequence length is finite * Clarify data model and validity documentation * Document VCF export behaviour and limits * Clarify file-format compatibility and text output * Clarify site statistics handling of missing data * Support metadata decoding in parse_edges * Fix Tree and TreeSequence docstrings * Review comments
1 parent 1e2ccb6 commit 179963c

File tree

9 files changed

+215
-68
lines changed

9 files changed

+215
-68
lines changed

c/tests/test_trees.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6575,6 +6575,14 @@ test_empty_tree_kc(void)
65756575
ret = tsk_treeseq_init(&ts, &tables, TSK_TS_INIT_BUILD_INDEXES);
65766576
CU_ASSERT_EQUAL_FATAL(ret, TSK_ERR_BAD_SEQUENCE_LENGTH);
65776577
tsk_treeseq_free(&ts);
6578+
tables.sequence_length = NAN;
6579+
ret = tsk_treeseq_init(&ts, &tables, TSK_TS_INIT_BUILD_INDEXES);
6580+
CU_ASSERT_EQUAL_FATAL(ret, TSK_ERR_BAD_SEQUENCE_LENGTH);
6581+
tsk_treeseq_free(&ts);
6582+
tables.sequence_length = INFINITY;
6583+
ret = tsk_treeseq_init(&ts, &tables, TSK_TS_INIT_BUILD_INDEXES);
6584+
CU_ASSERT_EQUAL_FATAL(ret, TSK_ERR_BAD_SEQUENCE_LENGTH);
6585+
tsk_treeseq_free(&ts);
65786586
tables.sequence_length = 1.0;
65796587
ret = tsk_treeseq_init(&ts, &tables, TSK_TS_INIT_BUILD_INDEXES);
65806588
CU_ASSERT_EQUAL_FATAL(ret, 0);

c/tskit/tables.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11078,7 +11078,7 @@ tsk_table_collection_check_integrity(
1107811078
| TSK_CHECK_MIGRATION_ORDERING | TSK_CHECK_INDEXES;
1107911079
}
1108011080

11081-
if (self->sequence_length <= 0) {
11081+
if (!tsk_isfinite(self->sequence_length) || self->sequence_length <= 0) {
1108211082
ret = tsk_trace_error(TSK_ERR_BAD_SEQUENCE_LENGTH);
1108311083
goto out;
1108411084
}

docs/data-model.md

Lines changed: 47 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -356,14 +356,10 @@ required for a valid set of populations.
356356

357357
#### Provenance Table
358358

359-
:::{todo}
360-
Document the provenance table.
361-
:::
362-
363359
| Column | Type | Description |
364360
| :-------- | ----- | ----------------------------------------------------------------------: |
365361
| timestamp | char | Timestamp in [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) format. |
366-
| record | char | Provenance record. |
362+
| record | char | Provenance record as JSON. |
367363

368364

369365
(sec_metadata_definition)=
@@ -374,10 +370,16 @@ Each table (excluding provenance) has a metadata column for storing and passing
374370
information that tskit does not use or interpret. See {ref}`sec_metadata` for details.
375371
The metadata columns are {ref}`binary columns <sec_tables_api_binary_columns>`.
376372

377-
When using the {ref}`sec_text_file_format`, to ensure that metadata can be safely
378-
interchanged, each row is [base 64 encoded](https://en.wikipedia.org/wiki/Base64).
379-
Thus, binary information can be safely printed and exchanged, but may not be
380-
human readable.
373+
When using the {ref}`sec_text_file_format`, metadata values are written as opaque
374+
text. By default, :meth:`TreeSequence.dump_text` will base64-encode metadata values
375+
that are stored as raw bytes (when ``base64_metadata=True``) so that binary data can
376+
be safely printed and exchanged; in this case :func:`tskit.load_text` will base64-decode
377+
the corresponding text fields back to bytes. When metadata has already been decoded
378+
to a structured Python object (for example via a metadata schema), the textual
379+
representation written by :meth:`TreeSequence.dump_text` is the ``repr`` of that
380+
object, and :func:`tskit.load_text` does not attempt to reconstruct the original
381+
structured value from this representation. For reliable metadata round-tripping,
382+
prefer the native binary tree sequence file format over the text formats.
381383

382384
The tree sequence itself also has metadata stored as a byte array.
383385

@@ -399,6 +401,10 @@ error message. Some more complex requirements may not be detectable at load-time
399401
and errors may not occur until certain operations are attempted.
400402
These are documented below.
401403

404+
At the tree-sequence level, we require that the coordinate space has a finite,
405+
strictly positive length; that is, the `sequence_length` attribute must be a
406+
finite value greater than zero.
407+
402408
The Python API also provides tools that can transform a collection of
403409
tables into a valid collection of tables, so long as they are logically
404410
consistent, see {ref}`sec_tables_api_creating_valid_tree_sequence`.
@@ -410,7 +416,8 @@ consistent, see {ref}`sec_tables_api_creating_valid_tree_sequence`.
410416

411417
Individuals are a basic type in a tree sequence and are not defined with
412418
respect to any other tables. Individuals can have a reference to their parent
413-
individuals, if present these references must be valid or null (-1).
419+
individuals, if present these references must be valid or null (-1). An
420+
individual cannot list itself as its own parent.
414421

415422
A valid tree sequence does not require individuals to be sorted in any
416423
particular order, and sorting a set of tables using {meth}`TableCollection.sort`
@@ -424,6 +431,7 @@ using {meth}`TableCollection.sort_individuals`.
424431
Given a valid set of individuals and populations, the requirements for
425432
each node are:
426433

434+
- `time` must be a finite (non-NaN, non-infinite) value;
427435
- `population` must either be null (-1) or refer to a valid population ID;
428436
- `individual` must either be null (-1) or refer to a valid individual ID.
429437

@@ -443,7 +451,7 @@ has no effect on nodes.
443451
Given a valid set of nodes and a sequence length {math}`L`, the simple
444452
requirements for each edge are:
445453

446-
- We must have {math}`0 \leq` `left` {math}`<` `right` {math}`\leq L`;
454+
- We must have finite coordinates with {math}`0 \leq` `left` {math}`<` `right` {math}`\leq L`;
447455
- `parent` and `child` must be valid node IDs;
448456
- `time[parent]` > `time[child]`;
449457
- edges must be unique (i.e., no duplicate edges are allowed).
@@ -480,7 +488,7 @@ properties are fulfilled.
480488
Given a valid set of nodes and a sequence length {math}`L`, the simple
481489
requirements for a valid set of sites are:
482490

483-
- We must have {math}`0 \leq` `position` {math}`< L`;
491+
- We must have a finite coordinate with {math}`0 \leq` `position` {math}`< L`;
484492
- `position` values must be unique.
485493

486494
For simplicity and algorithmic efficiency, sites must also:
@@ -546,19 +554,33 @@ will always fail. Use `tskit.is_unknown_time` to detect unknown values.
546554

547555
#### Migration requirements
548556

549-
Given a valid set of nodes and edges, the requirements for a value set of
557+
Given a valid set of nodes and edges, the requirements for a valid set of
550558
migrations are:
551559

552-
- `left` and `right` must lie within the tree sequence coordinate space (i.e.,
553-
from 0 to `sequence_length`).
554-
- `time` must be strictly between the time of its `node` and the time of any
555-
ancestral node from which that node inherits on the segment `[left, right)`.
556-
- The `population` of any such ancestor matching `source`, if another
557-
`migration` does not intervene.
560+
- `left` and `right` must be finite values that lie within the tree sequence
561+
coordinate space (i.e., from 0 to `sequence_length`), with {math}`0 \leq`
562+
`left` {math}`<` `right` {math}`\leq L`;
563+
- `node` must be a valid node ID;
564+
- if population references are checked, `source` and `dest` must be valid
565+
population IDs;
566+
- `time` must be a finite value.
558567

559-
To enable efficient processing, migrations must also be:
568+
To enable efficient processing, migrations must also be sorted by
569+
nondecreasing `time` value.
560570

561-
- Sorted by nondecreasing `time` value.
571+
Conceptually, a migration records that a segment of ancestry for the given
572+
`node` moves between populations along the tree. In typical demographic
573+
models we expect:
574+
575+
- `time` to lie strictly between the time of the migrating `node` and the time
576+
of any ancestral node from which that node inherits on the segment
577+
`[left, right)`;
578+
- the `population` of any such ancestor to match the `source` population,
579+
until another `migration` intervenes.
580+
581+
These conceptual relationships are not currently validated. It is
582+
the responsibility of code that creates migrations to satisfy them where
583+
required.
562584

563585
Note in particular that there is no requirement that adjacent migration records
564586
should be "squashed". That is, we can have two records `m1` and `m2`
@@ -582,8 +604,10 @@ There are no requirements on a population table.
582604
The `timestamp` column of a provenance table should be in
583605
[ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) format.
584606

585-
The `record` should be valid JSON with structure defined in the Provenance
586-
Schema section (TODO).
607+
The `record` column stores a JSON document describing how and where the tree sequence
608+
was produced. For tree sequences generated by tskit and related tools, this JSON is
609+
expected to conform to the :ref:`provenance schema <sec_provenance_schema>` described
610+
in {ref}`sec_provenance`.
587611

588612

589613
(sec_table_indexes)=
@@ -1148,4 +1172,3 @@ you won't see those parts of the tree sequence that are unrelated to the samples
11481172
If you need to get those, too, you could either
11491173
work with the {meth}`TreeSequence.edge_diffs` directly,
11501174
or iterate over all nodes (instead of over {meth}`Tree.nodes`).
1151-

docs/export.md

Lines changed: 69 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -319,6 +319,18 @@ the individual IDs are 2 and 3. See the
319319
these default labels.
320320
:::
321321

322+
If some individuals have no associated nodes, they are omitted from the
323+
VCF output. By default, only nodes that are marked as samples contribute
324+
to the VCF genotypes; to include non-sample nodes as well (e.g., internal
325+
nodes that have been marked as individuals), set
326+
``include_non_sample_nodes=True`` when calling :meth:`TreeSequence.write_vcf`.
327+
328+
:::{note}
329+
At present, :meth:`TreeSequence.write_vcf` only supports sites with up to
330+
9 distinct alleles; attempting to write a site with more than 9 alleles
331+
will result in a :class:`ValueError`.
332+
:::
333+
322334
(sec_export_vcf_individual_names)=
323335

324336
### Individual names
@@ -369,31 +381,67 @@ id and individual id, and two underscores will throw an error.
369381

370382
### Modifying coordinates
371383

372-
Tskit supports continuous genome coordinates, but VCF only supports discrete
373-
genome positions. Thus, when we convert a tree sequence that has sites
374-
at continuous positions we must discretise the coordinates in some way.
375-
376-
The ``position_transform`` argument provides a way to flexibly translate
377-
the genomic location of sites in tskit to the appropriate value in VCF.
378-
There are two fundamental differences in the way that tskit and VCF define
379-
genomic coordinates. The first is that tskit uses floating point values
380-
to encode positions, whereas VCF uses integers. Thus, if the tree sequence
381-
contains positions at non-integral locations there is an information loss
382-
incurred by translating to VCF. By default, we round the site positions
383-
to the nearest integer, such that there may be several sites with the
384-
same integer position in the output. The second difference between VCF
385-
and tskit is that VCF is defined to be a 1-based coordinate system, whereas
386-
tskit uses 0-based. However, how coordinates are transformed depends
387-
on the VCF parser, and so we do **not** account for this change in
388-
coordinate system by default.
384+
Tree sequence site positions can be floating point values, whereas VCF
385+
requires positive integers. The ``position_transform`` argument
386+
controls how tskit maps coordinates into VCF. Translating non-integer
387+
positions necessarily loses precision; by default we round to the nearest
388+
integer, so multiple sites may share the same output position.
389+
Furthermore, tskit's coordinate system starts at zero,
390+
whereas the VCF standard requires positions to be positive,
391+
and so a site at position 0 is not valid in the VCF standard.
392+
Because VCF parsers differ, we do **not** do anything to account for this.
393+
394+
The simplest resolution of this discrepancy in convention between tskit and VCF
395+
positions is deal with any site at position 0 as a special case (for instance,
396+
by discarding or ignoring it).
397+
A different interpretation of this difference between tskit's position
398+
and VCF's POS field
399+
is that they are different coordinate systems: tskit coordinates are
400+
"distance to the right of the left end of the chromosome",
401+
while VCF coordinates are "which number site, counting from the left end
402+
of the chromosome and starting at one".
403+
Under this interpretation, the solution is to supply an explicit
404+
``position_transform`` that adds 1 to the coordinate when outputting
405+
to VCF (or, using the ``"legacy"`` option described below). However, this can
406+
easily lead to off-by-one errors converting between the coordinate systems,
407+
so should only be used if you really are using 0-based coordinates for your
408+
tree sequence.
409+
410+
:::{warning}
411+
Most VCF tools cannot deal with a POS value of 0. If your tree
412+
sequence contains a site with position 0, this will likely cause an error.
413+
:::
414+
415+
Internally, the coordinates used in the VCF output are obtained by applying
416+
the ``position_transform`` function to the array of site positions (and, for
417+
the contig length, to the tree sequence :attr:`.TreeSequence.sequence_length`).
418+
This function must return a one-dimensional array of the same length as its
419+
input; otherwise a :class:`ValueError` is raised. In addition to accepting a
420+
callable, tskit also supports the string value ``"legacy"`` here, which
421+
selects the pre-0.2.0 behaviour used by the original VCF exporter:
422+
positions are rounded to the nearest integer, starting at 1, and are forced
423+
to be strictly increasing by incrementing ties.
424+
425+
The VCF specification does not allow positions to be 0. By default, if any
426+
transformed position is 0, :meth:`TreeSequence.write_vcf` will raise a
427+
:class:`ValueError`. If you wish to retain these records you can either:
428+
429+
- set ``allow_position_zero=True`` to write such sites anyway;
430+
- mask the offending sites using the ``site_mask`` argument; or
431+
- choose a ``position_transform`` that maps 0 to a valid positive position.
432+
433+
For example, to shift all coordinates by 1, we could define:
434+
435+
```{code-cell}
436+
def one_based_positions(positions):
437+
return [int(round(x)) + 1 for x in positions]
438+
439+
ts.write_vcf(sys.stdout, position_transform=one_based_positions)
440+
```
389441

390442
:::{note}
391443
The msprime 0.x legacy API simulates using continuous coordinates. It may
392444
be simpler to update your code to use the msprime 1.0 API (which uses
393445
discrete coordinates by default) than to work out how to transform
394446
coordinates in a way that is suitable for your application.
395447
:::
396-
397-
:::{todo}
398-
Provide some examples of using position transform.
399-
:::

docs/file-formats.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -43,9 +43,13 @@ e.g. ``python -m kastore ls file.trees``.
4343

4444
### Legacy Versions
4545

46-
Tree sequence files written by older versions of tskit are not readable by
47-
newer versions of tskit. For tskit releases<0.6.2, `tskit upgrade`
48-
will convert older tree sequence files to the latest version.
46+
Tree sequence files are versioned. This version of tskit can read
47+
``.trees`` files produced by earlier releases that use the same *major*
48+
file format version (see ``format/version`` in a kastore listing).
49+
Files written using the pre-kastore HDF5 format (for example, by
50+
msprime < 0.6.0 or tskit < 0.6.2) cannot be read directly. To convert
51+
such legacy files, use the ``tskit upgrade`` command from an older
52+
tskit version (< 0.6.2) to produce a modern ``.trees`` file.
4953

5054

5155
(sec_text_file_format)=
@@ -234,8 +238,14 @@ ts.dump_text(sites=sys.stdout)
234238
The mutation text format must contain the columns `site`,
235239
`node` and `derived_state`. The `time`, `parent` and `metadata` columns
236240
may also be optionally present (but `parent` must be specified if
237-
more than one mutation occurs at the same site). If `time` is absent
238-
`UNKNOWN_TIME` will be used to fill the column. See the
241+
more than one mutation occurs at the same site). If the `time` column
242+
is absent, the mutation times in the resulting tree sequence are set to
243+
{data}`tskit.UNKNOWN_TIME`, which is a numeric value that behaves like NaN.
244+
Unknown mutation times written out by
245+
{meth}`TreeSequence.dump_text` are represented in the text file by the
246+
literal string ``\"unknown\"`` in the `time` column, and
247+
{func}`tskit.load_text` treats this string as `UNKNOWN_TIME` on input.
248+
See the
239249
{ref}`mutation table definitions <sec_mutation_table_definition>`
240250
for details on these columns.
241251

@@ -280,5 +290,4 @@ ts.dump_text(populations=sys.stdout)
280290
```
281291

282292
The `metadata` contains base64-encoded data (in this case, the strings
283-
`pop1` and `pop1`).
284-
293+
`pop1` and `pop2`).

docs/stats.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,11 +35,14 @@ Please see the {ref}`tutorial <tutorials:sec_tutorial_stats>` for examples of th
3535
statistics API in use.
3636

3737
:::{warning}
38-
{ref}`sec_data_model_missing_data` is not currently
39-
handled correctly by site statistics defined here, as we always
40-
assign missing data to be equal to the ancestral state. Later
41-
versions will add this behaviour as an option and will account
42-
for the presence of missing data by default.
38+
Site statistics defined here currently treat :ref:`missing data<sec_data_model_missing_data>`
39+
in the same way as in earlier versions of tskit: when computing site-based
40+
statistics, isolated samples without mutations directly above them are treated
41+
as carrying the ancestral allele rather than as missing. Future versions of
42+
tskit may expose options to treat missing data differently in statistics; for
43+
now, if you need explicit control over how missing data is handled you should
44+
use the low-level genotype/variant APIs (for example with
45+
``isolated_as_missing=True``) together with your own summary logic.
4346
:::
4447

4548
(sec_stats_available)=

python/tests/test_metadata.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,35 @@ def test_nodes(self):
217217
for a, b in zip(expected, n):
218218
assert a.encode("utf8") == b.metadata
219219

220+
@pytest.mark.parametrize(
221+
"base64_metadata,metadata_text,expected",
222+
[(True, "YWJj", b"abc"), (False, "plain", b"plain")],
223+
)
224+
def test_edges_metadata(self, base64_metadata, metadata_text, expected):
225+
edges = io.StringIO(
226+
f"""\
227+
left right parent child metadata
228+
0.0 1.0 2 0,1 {metadata_text}
229+
"""
230+
)
231+
table = tskit.parse_edges(
232+
edges, strict=False, encoding="utf8", base64_metadata=base64_metadata
233+
)
234+
assert len(table) == 2
235+
for row in table:
236+
assert row.metadata == expected
237+
238+
def test_edges_without_metadata_column(self):
239+
edges = io.StringIO(
240+
"""\
241+
left right parent child
242+
0.0 1.0 2 3
243+
"""
244+
)
245+
table = tskit.parse_edges(edges, strict=False, encoding="utf8")
246+
assert len(table) == 1
247+
assert table[0].metadata == b""
248+
220249
def test_sites(self):
221250
sites = io.StringIO(
222251
"""\

python/tests/test_tables.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2907,6 +2907,12 @@ def test_round_trip(self):
29072907
b = tables.tree_sequence()
29082908
assert a.tables == b.tables
29092909

2910+
@pytest.mark.parametrize("sequence_length", [np.inf, np.nan])
2911+
def test_nonfinite_sequence_length(self, sequence_length):
2912+
tables = tskit.TableCollection(sequence_length=sequence_length)
2913+
with pytest.raises(tskit.LibraryError, match="TSK_ERR_BAD_SEQUENCE_LENGTH"):
2914+
tables.tree_sequence()
2915+
29102916

29112917
class TestMutationTimeErrors:
29122918
def test_younger_than_node_below(self):

0 commit comments

Comments
 (0)