Skip to content

Commit 09541c4

Browse files
committed
edits
1 parent 1cd3753 commit 09541c4

File tree

1 file changed

+116
-27
lines changed

1 file changed

+116
-27
lines changed

docs/ibd.md

Lines changed: 116 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -73,18 +73,24 @@ coordinate ``[left, right)`` and ancestral node ``a`` iff the most
7373
recent common ancestor of the segment ``[left, right)`` in nodes ``u``
7474
and ``v`` is ``a``, and the segment has been inherited along the same
7575
genealogical path (ie. it has not been broken by recombination). The
76-
segments returned are the longest possible ones: for a fixed pair
77-
``(u, v)`` we follow the ancestral paths from ``u`` and ``v`` up the
78-
trees and merge together adjacent genomic intervals whenever both the
79-
MRCA ``a`` and the full ancestral paths from ``u`` and ``v`` to ``a``
80-
are identical.
76+
definition of a "genealogical path" used here is
77+
the sequence of edges, rather than nodes.
78+
So, for instance, if ``u`` inherits a segment ``[x, z)`` from ``a``,
79+
but that inheritance is represented by two edges,
80+
one spanning ``[x, y)`` and the other spanning ``[y, z)``,
81+
then this represents two genealogical paths,
82+
and any IBD segments would be split at ``y``.
83+
In other words, the method assumes that the end
84+
of an edge represents a recombination,
85+
an assumption that may not reflect how the tree sequence
86+
is used -- see below for more discussion.
8187

8288
This definition is purely genealogical: it depends only on the tree
8389
sequence topology and node times, and does not inspect allelic
8490
states or mutations. In particular, if we compute the MRCA of ``(u, v)``
8591
in each tree along the sequence, then (up to the additional refinement
86-
by genealogical path) the IBD segments are obtained by merging together
87-
adjacent MRCA intervals that share the same ancestor and paths to that
92+
by genealogical path) the IBD segments are those
93+
that share the same ancestor and paths to that
8894
ancestor. Intervals in which ``u`` and ``v`` lie in different roots
8995
have no MRCA and therefore do not contribute IBD segments.
9096

@@ -202,8 +208,6 @@ so only the pair ``(0, 2)`` appears in the result. In general:
202208
considered.
203209
- If ``within`` is omitted (the default), all nodes flagged as samples
204210
in the node table are used.
205-
- Passing an empty list, e.g. ``within=[]``, is allowed and simply
206-
yields a result with zero pairs and zero segments.
207211

208212
#### IBD between sample sets
209213

@@ -221,8 +225,6 @@ in the first set and the other lies in the second. More generally:
221225
- ``between`` should be a list of non-overlapping lists of node IDs.
222226
- All pairs ``(u, v)`` are considered such that ``u`` and ``v`` belong
223227
to different sample sets.
224-
- Empty sample sets are permitted (e.g., ``between=[[0, 1], []]``) and
225-
simply do not contribute any pairs.
226228

227229
The ``within`` and ``between`` arguments are mutually exclusive: passing
228230
both at the same time raises a :class:`ValueError`.
@@ -239,17 +241,16 @@ segments that we consider.
239241

240242
The ``max_time`` argument specifies an upper bound on the time of the
241243
common ancestor node: only IBD segments whose MRCA node has a time
242-
no greater than ``max_time`` are returned. The time is measured in
243-
the same units as the node times in the tree sequence (e.g., generations).
244+
no greater than ``max_time`` are returned.
244245

245246
The ``min_span`` argument filters by genomic length: only segments with
246-
span strictly greater than ``min_span`` are included. This threshold is
247-
measured in the same units as the ``sequence_length`` (for example,
248-
base pairs).
247+
span strictly greater than ``min_span`` are included.
249248

250-
For example:
249+
For example, working with ``ts2`` as the following tree sequence:
251250

252251
```{code-cell}
252+
:tags: [hide-input]
253+
253254
import io
254255
255256
nodes = io.StringIO(
@@ -258,30 +259,118 @@ nodes = io.StringIO(
258259
0 1 0
259260
1 1 0
260261
2 0 1
261-
3 0 1.5
262+
3 0 3
262263
"""
263264
)
264265
edges = io.StringIO(
265266
"""\
266267
left right parent child
267-
0.0 0.4 2 0,1
268-
0.4 1.0 3 0,1
268+
0 4 2 0,1
269+
4 10 3 0,1
269270
"""
270271
)
271272
ts2 = tskit.load_text(nodes=nodes, edges=edges, strict=False)
273+
SVG(ts2.draw_svg())
274+
```
272275

276+
There are two segments:
277+
```{code-cell}
273278
segments = ts2.ibd_segments(store_segments=True)
274279
print("all segments:", list(segments.values())[0])
275-
276-
segments_recent = ts2.ibd_segments(max_time=1.2, store_segments=True)
280+
```
281+
... but only the left-hand one is more recent than 2 time units ago:
282+
```{code-cell}
283+
segments_recent = ts2.ibd_segments(max_time=2, store_segments=True)
277284
print("max_time=1.2:", list(segments_recent.values())[0])
285+
```
286+
... and only the right-hand one is longer than 5 units.
287+
```{code-cell}
278288
279-
segments_long = ts2.ibd_segments(min_span=0.5, store_segments=True)
289+
segments_long = ts2.ibd_segments(min_span=5, store_segments=True)
280290
print("min_span=0.5:", list(segments_long.values())[0])
281291
```
282292

283-
Here the full result contains two IBD segments for the single sample
284-
pair, one inherited via ancestor 2 over ``[0.0, 0.4)`` and one via
285-
ancestor 3 over ``[0.4, 1.0)``. The ``max_time`` constraint removes the
286-
segment inherited from the older ancestor (time 1.5), while the
293+
So: the full result contains two IBD segments for the single sample
294+
pair, one inherited via ancestor 2 over ``[0, 4)`` and one via
295+
ancestor 3 over ``[4, 10)``. The ``max_time`` constraint removes the
296+
segment inherited from the older ancestor (time 3), while the
287297
``min_span`` constraint keeps only the longer of the two segments.
298+
299+
### More on the "pathwise" definition of IBD segments
300+
301+
We said above that the definition of IBD used by
302+
{meth}`.TreeSequence.ibd_segments` says that a given segment
303+
must be inherited from the MRCA along a single genealogical path,
304+
and that "genealogical paths" are defined *edgewise*.
305+
This can lead to surprising consequences.
306+
307+
Returning to our example above:
308+
```{code-cell}
309+
:tags: [hide-input]
310+
311+
SVG(ts.draw_svg())
312+
```
313+
there are two IBD segments between ``1`` and ``2``:
314+
```{code-cell}
315+
segments = ts.ibd_segments(within=[1, 2], store_pairs=True)
316+
for pair, value in segments.items():
317+
print(pair, "::", value)
318+
```
319+
This might be surprising, because the MRCA of ``1`` and ``2``
320+
is node ``4`` over the entire sequence.
321+
In fact, some definitions of IBD segments
322+
would have this as a single segment,
323+
because the MRCA does not change,
324+
even if there are distinct genealogical paths.
325+
326+
The reason this is split into two segments
327+
is because the path from ``4`` to ``2`` changes:
328+
on the left-hand segment ``[0, 2)``, the node ``2``
329+
inherits from node ``4``
330+
via node ``3``, while on the right-hand segment ``[2, 10)``
331+
it inherits from node ``4`` directly.
332+
The tree sequence doesn't say directly whether node ``2``
333+
also inherits from node ``3`` on the right-hand segment,
334+
so whether or not this should be one IBD segment or two
335+
depends on our interpretation
336+
of what's stored in the tree sequence.
337+
As discussed in
338+
[Fritze et al](https://doi.org/10.1093/genetics/iyaf198),
339+
most tree sequence simulators (at time of writing)
340+
will produce this tree sequence even if node ``2``
341+
does in fact inherit from ``3`` over the entire sequence.
342+
Using {meth}`.TreeSequence.extend_haplotypes` will
343+
"put the unary nodes back":
344+
```{code-cell}
345+
ets = ts.extend_haplotypes()
346+
SVG(ets.draw_svg())
347+
```
348+
and once this is done, there is only a single IBD segment:
349+
```{code-cell}
350+
segments = ets.ibd_segments(within=[1, 2], store_pairs=True)
351+
for pair, value in segments.items():
352+
print(pair, "::", value)
353+
```
354+
So, extending haplotypes may produce IBD segments
355+
more in line with theory, if the desired definition if IBD
356+
is the "pathwise" definition.
357+
However, this will also probably introduce erroneous
358+
portions of IBD segments,
359+
so caution is needed.
360+
Another approach would be to merge adjacent segments of IBD
361+
that have the same MRCA.
362+
363+
Summarizing this section --
364+
there is a confusing array of possible definitions
365+
of what it means to be "an IBD segment";
366+
and these may be extracted from a tree sequence
367+
in subtly different ways.
368+
How much of a problem is this?
369+
The answer depends on the precise situation,
370+
but it seems likely that in practice,
371+
differences due to definition are small
372+
relative to errors due to tree sequence inference.
373+
Indeed, empirical haplotype-matching methods
374+
for identifying IBD segments can differ substantially
375+
depending on the values of various hyperparameters.
376+
More work is needed to develop a complete picture.

0 commit comments

Comments
 (0)