@@ -73,18 +73,24 @@ coordinate ``[left, right)`` and ancestral node ``a`` iff the most
7373recent common ancestor of the segment `` [left, right) `` in nodes `` u ``
7474and `` v `` is `` a `` , and the segment has been inherited along the same
7575genealogical path (ie. it has not been broken by recombination). The
76- segments returned are the longest possible ones: for a fixed pair
77- `` (u, v) `` we follow the ancestral paths from `` u `` and `` v `` up the
78- trees and merge together adjacent genomic intervals whenever both the
79- MRCA `` a `` and the full ancestral paths from `` u `` and `` v `` to `` a ``
80- are identical.
76+ definition of a "genealogical path" used here is
77+ the sequence of edges, rather than nodes.
78+ So, for instance, if `` u `` inherits a segment `` [x, z) `` from `` a `` ,
79+ but that inheritance is represented by two edges,
80+ one spanning `` [x, y) `` and the other spanning `` [y, z) `` ,
81+ then this represents two genealogical paths,
82+ and any IBD segments would be split at `` y `` .
83+ In other words, the method assumes that the end
84+ of an edge represents a recombination,
85+ an assumption that may not reflect how the tree sequence
86+ is used -- see below for more discussion.
8187
8288This definition is purely genealogical: it depends only on the tree
8389sequence topology and node times, and does not inspect allelic
8490states or mutations. In particular, if we compute the MRCA of `` (u, v) ``
8591in each tree along the sequence, then (up to the additional refinement
86- by genealogical path) the IBD segments are obtained by merging together
87- adjacent MRCA intervals that share the same ancestor and paths to that
92+ by genealogical path) the IBD segments are those
93+ that share the same ancestor and paths to that
8894ancestor. Intervals in which `` u `` and `` v `` lie in different roots
8995have no MRCA and therefore do not contribute IBD segments.
9096
@@ -202,8 +208,6 @@ so only the pair ``(0, 2)`` appears in the result. In general:
202208 considered.
203209- If `` within `` is omitted (the default), all nodes flagged as samples
204210 in the node table are used.
205- - Passing an empty list, e.g. `` within=[] `` , is allowed and simply
206- yields a result with zero pairs and zero segments.
207211
208212#### IBD between sample sets
209213
@@ -221,8 +225,6 @@ in the first set and the other lies in the second. More generally:
221225- `` between `` should be a list of non-overlapping lists of node IDs.
222226- All pairs `` (u, v) `` are considered such that `` u `` and `` v `` belong
223227 to different sample sets.
224- - Empty sample sets are permitted (e.g., `` between=[[0, 1], []] `` ) and
225- simply do not contribute any pairs.
226228
227229The `` within `` and `` between `` arguments are mutually exclusive: passing
228230both at the same time raises a :class:` ValueError ` .
@@ -239,17 +241,16 @@ segments that we consider.
239241
240242The `` max_time `` argument specifies an upper bound on the time of the
241243common ancestor node: only IBD segments whose MRCA node has a time
242- no greater than `` max_time `` are returned. The time is measured in
243- the same units as the node times in the tree sequence (e.g., generations).
244+ no greater than `` max_time `` are returned.
244245
245246The `` min_span `` argument filters by genomic length: only segments with
246- span strictly greater than `` min_span `` are included. This threshold is
247- measured in the same units as the `` sequence_length `` (for example,
248- base pairs).
247+ span strictly greater than `` min_span `` are included.
249248
250- For example:
249+ For example, working with `` ts2 `` as the following tree sequence :
251250
252251``` {code-cell}
252+ :tags: [hide-input]
253+
253254import io
254255
255256nodes = io.StringIO(
@@ -258,30 +259,118 @@ nodes = io.StringIO(
258259 0 1 0
259260 1 1 0
260261 2 0 1
261- 3 0 1.5
262+ 3 0 3
262263 """
263264)
264265edges = io.StringIO(
265266 """\
266267 left right parent child
267- 0.0 0. 4 2 0,1
268- 0. 4 1.0 3 0,1
268+ 0 4 2 0,1
269+ 4 10 3 0,1
269270 """
270271)
271272ts2 = tskit.load_text(nodes=nodes, edges=edges, strict=False)
273+ SVG(ts2.draw_svg())
274+ ```
272275
276+ There are two segments:
277+ ``` {code-cell}
273278segments = ts2.ibd_segments(store_segments=True)
274279print("all segments:", list(segments.values())[0])
275-
276- segments_recent = ts2.ibd_segments(max_time=1.2, store_segments=True)
280+ ```
281+ ... but only the left-hand one is more recent than 2 time units ago:
282+ ``` {code-cell}
283+ segments_recent = ts2.ibd_segments(max_time=2, store_segments=True)
277284print("max_time=1.2:", list(segments_recent.values())[0])
285+ ```
286+ ... and only the right-hand one is longer than 5 units.
287+ ``` {code-cell}
278288
279- segments_long = ts2.ibd_segments(min_span=0. 5, store_segments=True)
289+ segments_long = ts2.ibd_segments(min_span=5, store_segments=True)
280290print("min_span=0.5:", list(segments_long.values())[0])
281291```
282292
283- Here the full result contains two IBD segments for the single sample
284- pair, one inherited via ancestor 2 over `` [0.0, 0. 4) `` and one via
285- ancestor 3 over `` [0. 4, 1.0 ) `` . The `` max_time `` constraint removes the
286- segment inherited from the older ancestor (time 1.5 ), while the
293+ So: the full result contains two IBD segments for the single sample
294+ pair, one inherited via ancestor 2 over `` [0, 4) `` and one via
295+ ancestor 3 over `` [4, 10 ) `` . The `` max_time `` constraint removes the
296+ segment inherited from the older ancestor (time 3 ), while the
287297`` min_span `` constraint keeps only the longer of the two segments.
298+
299+ ### More on the "pathwise" definition of IBD segments
300+
301+ We said above that the definition of IBD used by
302+ {meth}` .TreeSequence.ibd_segments ` says that a given segment
303+ must be inherited from the MRCA along a single genealogical path,
304+ and that "genealogical paths" are defined * edgewise* .
305+ This can lead to surprising consequences.
306+
307+ Returning to our example above:
308+ ``` {code-cell}
309+ :tags: [hide-input]
310+
311+ SVG(ts.draw_svg())
312+ ```
313+ there are two IBD segments between `` 1 `` and `` 2 `` :
314+ ``` {code-cell}
315+ segments = ts.ibd_segments(within=[1, 2], store_pairs=True)
316+ for pair, value in segments.items():
317+ print(pair, "::", value)
318+ ```
319+ This might be surprising, because the MRCA of `` 1 `` and `` 2 ``
320+ is node `` 4 `` over the entire sequence.
321+ In fact, some definitions of IBD segments
322+ would have this as a single segment,
323+ because the MRCA does not change,
324+ even if there are distinct genealogical paths.
325+
326+ The reason this is split into two segments
327+ is because the path from `` 4 `` to `` 2 `` changes:
328+ on the left-hand segment `` [0, 2) `` , the node `` 2 ``
329+ inherits from node `` 4 ``
330+ via node `` 3 `` , while on the right-hand segment `` [2, 10) ``
331+ it inherits from node `` 4 `` directly.
332+ The tree sequence doesn't say directly whether node `` 2 ``
333+ also inherits from node `` 3 `` on the right-hand segment,
334+ so whether or not this should be one IBD segment or two
335+ depends on our interpretation
336+ of what's stored in the tree sequence.
337+ As discussed in
338+ [ Fritze et al] ( https://doi.org/10.1093/genetics/iyaf198 ) ,
339+ most tree sequence simulators (at time of writing)
340+ will produce this tree sequence even if node `` 2 ``
341+ does in fact inherit from `` 3 `` over the entire sequence.
342+ Using {meth}` .TreeSequence.extend_haplotypes ` will
343+ "put the unary nodes back":
344+ ``` {code-cell}
345+ ets = ts.extend_haplotypes()
346+ SVG(ets.draw_svg())
347+ ```
348+ and once this is done, there is only a single IBD segment:
349+ ``` {code-cell}
350+ segments = ets.ibd_segments(within=[1, 2], store_pairs=True)
351+ for pair, value in segments.items():
352+ print(pair, "::", value)
353+ ```
354+ So, extending haplotypes may produce IBD segments
355+ more in line with theory, if the desired definition if IBD
356+ is the "pathwise" definition.
357+ However, this will also probably introduce erroneous
358+ portions of IBD segments,
359+ so caution is needed.
360+ Another approach would be to merge adjacent segments of IBD
361+ that have the same MRCA.
362+
363+ Summarizing this section --
364+ there is a confusing array of possible definitions
365+ of what it means to be "an IBD segment";
366+ and these may be extracted from a tree sequence
367+ in subtly different ways.
368+ How much of a problem is this?
369+ The answer depends on the precise situation,
370+ but it seems likely that in practice,
371+ differences due to definition are small
372+ relative to errors due to tree sequence inference.
373+ Indeed, empirical haplotype-matching methods
374+ for identifying IBD segments can differ substantially
375+ depending on the values of various hyperparameters.
376+ More work is needed to develop a complete picture.
0 commit comments