Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
203 changes: 187 additions & 16 deletions docs/ibd.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,12 @@ kernelspec:
# Identity by descent

The {meth}`.TreeSequence.ibd_segments` method allows us to compute
segments of identity by descent.
segments of identity by descent along a tree sequence.

:::{note}
This documentation page is preliminary
:::

:::{todo}
Relate the concept of identity by descent to the MRCA spans in the tree sequence.
:::

## Examples

Let's take a simple tree sequence to illustrate the {meth}`.TreeSequence.ibd_segments`
Expand Down Expand Up @@ -77,7 +73,26 @@ coordinate ``[left, right)`` and ancestral node ``a`` iff the most
recent common ancestor of the segment ``[left, right)`` in nodes ``u``
and ``v`` is ``a``, and the segment has been inherited along the same
genealogical path (ie. it has not been broken by recombination). The
segments returned are the longest possible ones.
definition of a "genealogical path" used here is
the sequence of edges, rather than nodes.
So, for instance, if ``u`` inherits a segment ``[x, z)`` from ``a``,
but that inheritance is represented by two edges,
one spanning ``[x, y)`` and the other spanning ``[y, z)``,
then this represents two genealogical paths,
and any IBD segments would be split at ``y``.
In other words, the method assumes that the end
of an edge represents a recombination,
an assumption that may not reflect how the tree sequence
is used -- see below for more discussion.

This definition is purely genealogical: it depends only on the tree
sequence topology and node times, and does not inspect allelic
states or mutations. In particular, if we compute the MRCA of ``(u, v)``
in each tree along the sequence, then (up to the additional refinement
by genealogical path) the IBD segments are those
that share the same ancestor and paths to that
ancestor. Intervals in which ``u`` and ``v`` lie in different roots
have no MRCA and therefore do not contribute IBD segments.

Consider the IBD segments that we get from our example tree sequence:

Expand Down Expand Up @@ -109,7 +124,11 @@ By default this class only stores the high-level summaries of the
IBD segments discovered. As we can see in this example, we have a
total of six segments and
the total span (i.e., the sum lengths of the genomic intervals spanned
by IBD segments) is 30.
by IBD segments) is 30. In this default mode the object does not
store information about individual sample pairs, and methods that
inspect per-pair information (such as indexing with ``[(a, b)]`` or
iterating over the mapping) will raise an
``IdentityPairsNotStoredError``.

If required, we can get more detailed information about particular
segment pairs and the actual segments using the ``store_pairs``
Expand Down Expand Up @@ -148,8 +167,12 @@ segments = ts.ibd_segments(store_pairs=True, store_segments=True)
segments[(0, 1)]
```

The {class}`.IdentitySegmentList` behaves like a Python list,
where each element is an instance of {class}`.IdentitySegment`.
When ``store_segments=True``, the {class}`.IdentitySegmentList` behaves
like a Python list, where each element is an instance of
{class}`.IdentitySegment`. When only ``store_pairs=True`` is specified,
the number of segments and their total span are still available, but
attempting to iterate over the list or access the per-segment arrays
will raise an ``IdentitySegmentsNotStoredError``.

:::{warning}
The order of segments in an {class}`.IdentitySegmentList`
Expand All @@ -168,19 +191,24 @@ By default we get the IBD segments between all pairs of
{ref}`sample<sec_data_model_definitions_sample>` nodes.

#### IBD within a sample set

We can reduce this to pairs within a specific set using the
``within`` argument:


```{eval-rst}
.. todo:: More detail and better examples here.
```

```{code-cell}
segments = ts.ibd_segments(within=[0, 2], store_pairs=True)
print(list(segments.keys()))
```

Here we have restricted attention to the samples with node IDs 0 and 2,
so only the pair ``(0, 2)`` appears in the result. In general:

- ``within`` should be a one-dimensional array-like of node IDs
(typically sample nodes). All unordered pairs from this set are
considered.
- If ``within`` is omitted (the default), all nodes flagged as samples
in the node table are used.

#### IBD between sample sets

We can also compute IBD **between** sample sets:
Expand All @@ -190,6 +218,17 @@ segments = ts.ibd_segments(between=[[0,1], [2]], store_pairs=True)
print(list(segments.keys()))
```

In this example we have two sample sets, ``[0, 1]`` and ``[2]``, so the
identity segments are computed only for pairs in which one sample lies
in the first set and the other lies in the second. More generally:

- ``between`` should be a list of non-overlapping lists of node IDs.
- All pairs ``(u, v)`` are considered such that ``u`` and ``v`` belong
to different sample sets.

The ``within`` and ``between`` arguments are mutually exclusive: passing
both at the same time raises a :class:`ValueError`.

:::{seealso}
See the {meth}`.TreeSequence.ibd_segments` documentation for
more details.
Expand All @@ -200,6 +239,138 @@ more details.
The ``max_time`` and ``min_span`` arguments allow us to constrain the
segments that we consider.

```{eval-rst}
.. todo:: Add examples for these arguments.
The ``max_time`` argument specifies an upper bound on the time of the
common ancestor node: only IBD segments whose MRCA node has a time
no greater than ``max_time`` are returned.

The ``min_span`` argument filters by genomic length: only segments with
span strictly greater than ``min_span`` are included.

For example, working with ``ts2`` as the following tree sequence:

```{code-cell}
:tags: [hide-input]

import io

nodes = io.StringIO(
"""\
id is_sample time
0 1 0
1 1 0
2 0 1
3 0 3
"""
)
edges = io.StringIO(
"""\
left right parent child
0 4 2 0,1
4 10 3 0,1
"""
)
ts2 = tskit.load_text(nodes=nodes, edges=edges, strict=False)
SVG(ts2.draw_svg())
```

There are two segments:
```{code-cell}
segments = ts2.ibd_segments(store_segments=True)
print("all segments:", list(segments.values())[0])
```
... but only the left-hand one is more recent than 2 time units ago:
```{code-cell}
segments_recent = ts2.ibd_segments(max_time=2, store_segments=True)
print("max_time=1.2:", list(segments_recent.values())[0])
```
... and only the right-hand one is longer than 5 units.
```{code-cell}

segments_long = ts2.ibd_segments(min_span=5, store_segments=True)
print("min_span=0.5:", list(segments_long.values())[0])
```

So: the full result contains two IBD segments for the single sample
pair, one inherited via ancestor 2 over ``[0, 4)`` and one via
ancestor 3 over ``[4, 10)``. The ``max_time`` constraint removes the
segment inherited from the older ancestor (time 3), while the
``min_span`` constraint keeps only the longer of the two segments.

### More on the "pathwise" definition of IBD segments

We said above that the definition of IBD used by
{meth}`.TreeSequence.ibd_segments` says that a given segment
must be inherited from the MRCA along a single genealogical path,
and that "genealogical paths" are defined *edgewise*.
This can lead to surprising consequences.

Returning to our example above:
```{code-cell}
:tags: [hide-input]

SVG(ts.draw_svg())
```
there are two IBD segments between ``1`` and ``2``:
```{code-cell}
segments = ts.ibd_segments(within=[1, 2], store_pairs=True)
for pair, value in segments.items():
print(pair, "::", value)
```
This might be surprising, because the MRCA of ``1`` and ``2``
is node ``4`` over the entire sequence.
In fact, some definitions of IBD segments
would have this as a single segment,
because the MRCA does not change,
even if there are distinct genealogical paths.

The reason this is split into two segments
is because the path from ``4`` to ``2`` changes:
on the left-hand segment ``[0, 2)``, the node ``2``
inherits from node ``4``
via node ``3``, while on the right-hand segment ``[2, 10)``
it inherits from node ``4`` directly.
The tree sequence doesn't say directly whether node ``2``
also inherits from node ``3`` on the right-hand segment,
so whether or not this should be one IBD segment or two
depends on our interpretation
of what's stored in the tree sequence.
As discussed in
[Fritze et al](https://doi.org/10.1093/genetics/iyaf198),
most tree sequence simulators (at time of writing)
will produce this tree sequence even if node ``2``
does in fact inherit from ``3`` over the entire sequence.
Using {meth}`.TreeSequence.extend_haplotypes` will
"put the unary nodes back":
```{code-cell}
ets = ts.extend_haplotypes()
SVG(ets.draw_svg())
```
and once this is done, there is only a single IBD segment:
```{code-cell}
segments = ets.ibd_segments(within=[1, 2], store_pairs=True)
for pair, value in segments.items():
print(pair, "::", value)
```
So, extending haplotypes may produce IBD segments
more in line with theory, if the desired definition if IBD
is the "pathwise" definition.
However, this will also probably introduce erroneous
portions of IBD segments,
so caution is needed.
Another approach would be to merge adjacent segments of IBD
that have the same MRCA.

Summarizing this section --
there is a confusing array of possible definitions
of what it means to be "an IBD segment";
and these may be extracted from a tree sequence
in subtly different ways.
How much of a problem is this?
The answer depends on the precise situation,
but it seems likely that in practice,
differences due to definition are small
relative to errors due to tree sequence inference.
Indeed, empirical haplotype-matching methods
for identifying IBD segments can differ substantially
depending on the values of various hyperparameters.
More work is needed to develop a complete picture.
19 changes: 11 additions & 8 deletions python/tskit/tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -2737,8 +2737,8 @@ def assert_equals(self, other, *, ignore_timestamps=False):
@dataclasses.dataclass(eq=True, order=True)
class IdentitySegment:
"""
A single segment of identity spanning a genomic interval for a
a specific ancestor node.
A single segment of identity by descent spanning a genomic interval
for a specific ancestor node.
"""

left: float
Expand All @@ -2758,7 +2758,7 @@ def span(self) -> float:

class IdentitySegmentList(collections.abc.Iterable, collections.abc.Sized):
"""
A summary of identity segments for some pair of samples in a
A summary of identity-by-descent segments for some pair of samples in a
:class:`.IdentitySegments` result. If the ``store_segments`` argument
has been specified to :meth:`.TreeSequence.ibd_segments`, this class
can be treated as a sequence of :class:`.IdentitySegment` objects.
Expand All @@ -2769,7 +2769,9 @@ class IdentitySegmentList(collections.abc.Iterable, collections.abc.Sized):

If ``store_segments`` is False, only the overall summary values
such as :attr:`.IdentitySegmentList.total_span` and ``len()`` are
available.
available. Attempting to iterate over the list or access per-segment
arrays (``left``, ``right``, or ``node``) in this case will raise an
``IdentitySegmentsNotStoredError``.

.. warning:: The order of segments within an IdentitySegmentList is
arbitrary and may change in the future
Expand Down Expand Up @@ -2833,7 +2835,7 @@ def node(self):
class IdentitySegments(collections.abc.Mapping):
"""
A class summarising and optionally storing the segments of identity
by state returned by :meth:`.TreeSequence.ibd_segments`. See the
by descent returned by :meth:`.TreeSequence.ibd_segments`. See the
:ref:`sec_identity` for more information and examples.

Along with the documented methods and attributes, the class supports
Expand All @@ -2845,9 +2847,10 @@ class IdentitySegments(collections.abc.Mapping):
for a given instance of this class are determined by the
``store_pairs`` and ``store_segments`` arguments provided to
:meth:`.TreeSequence.ibd_segments`. For example, attempting
to access per-sample pair information if ``store_pairs``
is False will result in a (hopefully informative) error being
raised.
to access per-sample pair information (such as indexing with
``[(a, b)]``, iterating over the mapping, or accessing
:attr:`.IdentitySegments.pairs`) if ``store_pairs`` is False will
result in an ``IdentityPairsNotStoredError`` being raised.

.. warning:: This class should not be instantiated directly.
"""
Expand Down
4 changes: 2 additions & 2 deletions python/tskit/trees.py
Original file line number Diff line number Diff line change
Expand Up @@ -10593,7 +10593,7 @@ def ibd_segments(
represented by the tree sequence.

:param list within: A list of node IDs defining set of nodes that
we finding IBD segments for. If not specified, this defaults to
we find IBD segments for. If not specified, this defaults to
all samples in the tree sequence.
:param list[list] between: A list of lists of sample node IDs. Given
two sample sets A and B, only IBD segments will be returned such
Expand All @@ -10608,7 +10608,7 @@ def ibd_segments(
segment) is greater than this value will be included. (Default=0)
:param bool store_pairs: If True store information separately for each
pair of samples ``(a, b)`` that are found to be IBD. Otherwise
store summary information about all sample apirs. (Default=False)
store summary information about all sample pairs. (Default=False)
:param bool store_segments: If True store each IBD segment
``(left, right, c)`` and associate it with the corresponding
sample pair ``(a, b)``. If True, implies ``store_pairs``.
Expand Down