@@ -356,14 +356,10 @@ required for a valid set of populations.
356356
357357#### Provenance Table
358358
359- :::{todo}
360- Document the provenance table.
361- :::
362-
363359| Column | Type | Description |
364360| :-------- | ----- | ----------------------------------------------------------------------: |
365361| timestamp | char | Timestamp in [ ISO-8601] ( https://en.wikipedia.org/wiki/ISO_8601 ) format. |
366- | record | char | Provenance record. |
362+ | record | char | Provenance record as JSON. |
367363
368364
369365(sec_metadata_definition)=
@@ -374,10 +370,16 @@ Each table (excluding provenance) has a metadata column for storing and passing
374370information that tskit does not use or interpret. See {ref}` sec_metadata ` for details.
375371The metadata columns are {ref}` binary columns <sec_tables_api_binary_columns> ` .
376372
377- When using the {ref}` sec_text_file_format ` , to ensure that metadata can be safely
378- interchanged, each row is [ base 64 encoded] ( https://en.wikipedia.org/wiki/Base64 ) .
379- Thus, binary information can be safely printed and exchanged, but may not be
380- human readable.
373+ When using the {ref}` sec_text_file_format ` , metadata values are written as opaque
374+ text. By default, :meth:` TreeSequence.dump_text ` will base64-encode metadata values
375+ that are stored as raw bytes (when `` base64_metadata=True `` ) so that binary data can
376+ be safely printed and exchanged; in this case :func:` tskit.load_text ` will base64-decode
377+ the corresponding text fields back to bytes. When metadata has already been decoded
378+ to a structured Python object (for example via a metadata schema), the textual
379+ representation written by :meth:` TreeSequence.dump_text ` is the `` repr `` of that
380+ object, and :func:` tskit.load_text ` does not attempt to reconstruct the original
381+ structured value from this representation. For reliable metadata round-tripping,
382+ prefer the native binary tree sequence file format over the text formats.
381383
382384The tree sequence itself also has metadata stored as a byte array.
383385
@@ -399,6 +401,10 @@ error message. Some more complex requirements may not be detectable at load-time
399401and errors may not occur until certain operations are attempted.
400402These are documented below.
401403
404+ At the tree-sequence level, we require that the coordinate space has a finite,
405+ strictly positive length; that is, the ` sequence_length ` attribute must be a
406+ finite value greater than zero.
407+
402408The Python API also provides tools that can transform a collection of
403409tables into a valid collection of tables, so long as they are logically
404410consistent, see {ref}` sec_tables_api_creating_valid_tree_sequence ` .
@@ -410,7 +416,8 @@ consistent, see {ref}`sec_tables_api_creating_valid_tree_sequence`.
410416
411417Individuals are a basic type in a tree sequence and are not defined with
412418respect to any other tables. Individuals can have a reference to their parent
413- individuals, if present these references must be valid or null (-1).
419+ individuals, if present these references must be valid or null (-1). An
420+ individual cannot list itself as its own parent.
414421
415422A valid tree sequence does not require individuals to be sorted in any
416423particular order, and sorting a set of tables using {meth}` TableCollection.sort `
@@ -424,6 +431,7 @@ using {meth}`TableCollection.sort_individuals`.
424431Given a valid set of individuals and populations, the requirements for
425432each node are:
426433
434+ - ` time ` must be a finite (non-NaN, non-infinite) value;
427435- ` population ` must either be null (-1) or refer to a valid population ID;
428436- ` individual ` must either be null (-1) or refer to a valid individual ID.
429437
@@ -443,7 +451,7 @@ has no effect on nodes.
443451Given a valid set of nodes and a sequence length {math}` L ` , the simple
444452requirements for each edge are:
445453
446- - We must have {math}` 0 \leq ` ` left ` {math}` < ` ` right ` {math}` \leq L ` ;
454+ - We must have finite coordinates with {math}` 0 \leq ` ` left ` {math}` < ` ` right ` {math}` \leq L ` ;
447455- ` parent ` and ` child ` must be valid node IDs;
448456- ` time[parent] ` > ` time[child] ` ;
449457- edges must be unique (i.e., no duplicate edges are allowed).
@@ -480,7 +488,7 @@ properties are fulfilled.
480488Given a valid set of nodes and a sequence length {math}` L ` , the simple
481489requirements for a valid set of sites are:
482490
483- - We must have {math}` 0 \leq ` ` position ` {math}` < L ` ;
491+ - We must have a finite coordinate with {math}` 0 \leq ` ` position ` {math}` < L ` ;
484492- ` position ` values must be unique.
485493
486494For simplicity and algorithmic efficiency, sites must also:
@@ -546,19 +554,33 @@ will always fail. Use `tskit.is_unknown_time` to detect unknown values.
546554
547555#### Migration requirements
548556
549- Given a valid set of nodes and edges, the requirements for a value set of
557+ Given a valid set of nodes and edges, the requirements for a valid set of
550558migrations are:
551559
552- - ` left ` and ` right ` must lie within the tree sequence coordinate space (i.e.,
553- from 0 to ` sequence_length ` ).
554- - ` time ` must be strictly between the time of its ` node ` and the time of any
555- ancestral node from which that node inherits on the segment ` [left, right) ` .
556- - The ` population ` of any such ancestor matching ` source ` , if another
557- ` migration ` does not intervene.
560+ - ` left ` and ` right ` must be finite values that lie within the tree sequence
561+ coordinate space (i.e., from 0 to ` sequence_length ` ), with {math}` 0 \leq `
562+ ` left ` {math}` < ` ` right ` {math}` \leq L ` ;
563+ - ` node ` must be a valid node ID;
564+ - if population references are checked, ` source ` and ` dest ` must be valid
565+ population IDs;
566+ - ` time ` must be a finite value.
558567
559- To enable efficient processing, migrations must also be:
568+ To enable efficient processing, migrations must also be sorted by
569+ nondecreasing ` time ` value.
560570
561- - Sorted by nondecreasing ` time ` value.
571+ Conceptually, a migration records that a segment of ancestry for the given
572+ ` node ` moves between populations along the tree. In typical demographic
573+ models we expect:
574+
575+ - ` time ` to lie strictly between the time of the migrating ` node ` and the time
576+ of any ancestral node from which that node inherits on the segment
577+ ` [left, right) ` ;
578+ - the ` population ` of any such ancestor to match the ` source ` population,
579+ until another ` migration ` intervenes.
580+
581+ These conceptual relationships are not currently validated. It is
582+ the responsibility of code that creates migrations to satisfy them where
583+ required.
562584
563585Note in particular that there is no requirement that adjacent migration records
564586should be "squashed". That is, we can have two records ` m1 ` and ` m2 `
@@ -582,8 +604,10 @@ There are no requirements on a population table.
582604The ` timestamp ` column of a provenance table should be in
583605[ ISO-8601] ( https://en.wikipedia.org/wiki/ISO_8601 ) format.
584606
585- The ` record ` should be valid JSON with structure defined in the Provenance
586- Schema section (TODO).
607+ The ` record ` column stores a JSON document describing how and where the tree sequence
608+ was produced. For tree sequences generated by tskit and related tools, this JSON is
609+ expected to conform to the :ref:` provenance schema <sec_provenance_schema> ` described
610+ in {ref}` sec_provenance ` .
587611
588612
589613(sec_table_indexes)=
@@ -1148,4 +1172,3 @@ you won't see those parts of the tree sequence that are unrelated to the samples
11481172If you need to get those, too, you could either
11491173work with the {meth}` TreeSequence.edge_diffs ` directly,
11501174or iterate over all nodes (instead of over {meth}` Tree.nodes ` ).
1151-
0 commit comments