fix: rejecting inf as value #471

proost · 2026-01-07T10:34:18Z

Currently I'm writing code for go t-Digest.

While I refer C++ implementation, i found that no guard for +inf, -inf.

Additional Question: There is no validation logic in the t-Digest about overflow or underflow from calculation. Is intended behavior?

AlexanderSaydakov · 2026-01-08T04:08:58Z

Regarding overflow and underflow, I believe I followed the reference implementation as close as practical. If you see potential problems I would suggest asking @tdunning first.

tdunning · 2026-01-08T04:32:16Z

@proost Let's talk. Feel free to ping me directly on my apache email address. I would love to work with you on a Go implementation and there are important simplifications in Go relative to the Java implementation because you can easily have an array of structs.

Regarding your question, a guard for ±inf is more subtle than it looks. If you insert a single infinite value, it will have a centroid with one data point in it. That's fine. If you insert another on the same side, you will start to get centroids with infinite value which is actually numerically correct. You will also have effects on the interpolation code, but that is also likely fairly reasonable.

How do you think an empirical distribution with infinite values should be handled?

What do you think the mean of such a distribution should be?

proost · 2026-01-08T05:12:33Z

@tdunning
Thank you for the thoughtful response. I now understand that whether to allow infinities really depends on what semantics we want the empirical distribution to have.

In my use case, I was mainly trying to protect the t-digest from values that come from numeric overflow or underflow, not from intentional ±∞ in the data. Your explanation helped clarify the distinction for me.

proost · 2026-01-08T05:14:53Z

@AlexanderSaydakov

There is one thing I wanted to clarify regarding this PR.

In the current implementation, ±∞ values can flow into the t-digest and be treated as data points. With this PR, ±∞ values passed to update() are silently ignored, and ±∞ passed to query methods such as get_rank() result in an exception. That means this change alters the behavior for users who might already be passing infinity values (intentionally or not), which effectively makes it a breaking change.

If supporting ±∞ as valid data was intentional in the original design, then I agree that changing this behavior is not appropriate. However, if it wasn’t an intentional design choice and infinity handling was simply unspecified, then I think explicitly rejecting/ignoring infinities and documenting the behavior would improve robustness and make the policy clearer for users.

I wanted to check your view on this before moving forward, since it affects compatibility and library semantics.

tdunning · 2026-01-08T07:31:15Z

If you have a function that adds up an array of values, what is the correct behavior if one of the values is +∞ and the others are normal numbers? I think the answer should be +∞ What if you pass in +∞ and -∞ along with normal values? I think the answer should be NaN This is the behavior that any function like sum should follow. This includes mean and, in a more nuanced way, t-digest. What is the median of [0, 0, 1, +∞, +∞] ? I think it should be 1, not 0. Julia agrees: ``` julia> median([0, 0, 1, +Inf, +Inf]) 1.0 ``` If the infinities are ignored, we get the wrong answer.

…

On Wed, Jan 7, 2026 at 9:15 PM Hyeonho Kim ***@***.***> wrote: *proost* left a comment (apache/datasketches-cpp#471) <#471 (comment)> @AlexanderSaydakov <https://github.com/AlexanderSaydakov> There is one thing I wanted to clarify regarding this PR. In the current implementation, ±∞ values can flow into the t-digest and be treated as data points. With this PR, ±∞ values passed to update() are silently ignored, and ±∞ passed to query methods such as get_rank() result in an exception. That means this change alters the behavior for users who might already be passing infinity values (intentionally or not), which effectively makes it a breaking change. If supporting ±∞ as valid data was intentional in the original design, then I agree that changing this behavior is not appropriate. However, if it wasn’t an intentional design choice and infinity handling was simply unspecified, then I think explicitly rejecting/ignoring infinities and documenting the behavior would improve robustness and make the policy clearer for users. I wanted to check your view on this before moving forward, since it affects compatibility and library semantics. — Reply to this email directly, view it on GitHub <#471 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB5E6SNQ4WHZEFLXWC2OYL4FXRWFAVCNFSM6AAAAACQ5Y54J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMRRHEZDKOJSGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

proost · 2026-01-08T10:44:46Z

@tdunning

Thanks for the detailed explanation and examples.

Since this PR goes against core model of data structure, I’m going to close it. Thanks
for taking the time to explain the reasoning — it helped clarify the right direction here!

tisonkun · 2026-01-12T02:56:23Z

Allowing +Inf and -Inf in TDigest structure would cause internal NaN values during compression or merging and potentially break internal assumption.

See apache/datasketches-java#702 and apache/datasketches-rust#23 (comment).

This is the behavior that any function like sum should follow. This includes mean and, in a more nuanced way, t-digest. What is the median of [0, 0, 1, +∞, +∞] ? I think it should be 1, not 0. Julia agrees: julia> median([0, 0, 1, +Inf, +Inf]) 1.0 If the infinities are ignored, we get the wrong answer.

I agree with this at some point. But since we're free to merge/compress tdigests sketches, this may cause issues described above.

tdunning · 2026-01-12T03:07:16Z

Arithmetic with infinity in IEEE floating point *can* produce NaN, but only when opposing values are involved. Sorting works correctly because infinities compare with normal numbers. Since the centroids are always sorted, merging will only ever combine infinities of like sign. This means that infinity will creep slowly like a contagion into the centroids in the digest. If there are lots of infinite values in a sample, we may get a slight over-estimate of quantiles because an entire centroid will go to infinity if a single infinite value is introduced, but no single centroid is that large in a reasonably configured digest. ``` julia> 3 + +Inf Inf julia> +Inf + +Inf Inf julia> +Inf + -Inf NaN julia> -Inf + -Inf -Inf julia> 3 + -Inf -Inf ```

…

On Sun, Jan 11, 2026 at 6:56 PM tison ***@***.***> wrote: *tisonkun* left a comment (apache/datasketches-cpp#471) <#471 (comment)> Allowing +Inf and -Inf in TDigest structure would cause internal NaN values during compression or merging and potentially break internal assumption. See apache/datasketches-java#702 <apache/datasketches-java#702> and apache/datasketches-rust#23 (comment) <apache/datasketches-rust#23 (comment)> . This is the behavior that any function like sum should follow. This includes mean and, in a more nuanced way, t-digest. What is the median of [0, 0, 1, +∞, +∞] ? I think it should be 1, not 0. Julia agrees: julia> median([0, 0, 1, +Inf, +Inf]) 1.0 If the infinities are ignored, we get the wrong answer. I agree with this at some point. But since we're free to merge/compress tdigests sketches, this may cause issues described above. — Reply to this email directly, view it on GitHub <#471 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB5E6QWNYN3IVQFMHGAUKL4GMEO5AVCNFSM6AAAAACQ5Y54J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMZWG4YTQNZWGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

tdigest/include/tdigest_impl.hpp

tisonkun · 2026-01-12T03:34:33Z

Arithmetic with infinity in IEEE floating point can produce NaN, but only when opposing values are involved. Sorting works correctly because infinities compare with normal numbers. Since the centroids are always sorted, merging will only ever combine infinities of like sign. This means that infinity will creep slowly like a contagion into the centroids in the digest. If there are lots of infinite values in a sample, we may get a slight over-estimate of quantiles because an entire centroid will go to infinity if a single infinite value is introduced, but no single centroid is that large in a reasonably configured digest.

In general inputs, we shall never see NaN, Inf, -Inf, or at least not too much.

We're here to discuss edge cases. So the trade-off is that, assuming the sketches may have +-inf but they never combine to produce NaN, or just filter all +-inf.

If we allow +-inf and assume they will never combine due to intermediate finite numbers, the question is how we program against the opposing values? Shall we mark the sketch in a broken state, and all operations then fail? Or crash the program? (Since NaN cannot compare with other values; or we can define a total order in some way.)

If we declare that this won't happen, then it means that is an undefined behavior.

tdunning · 2026-01-12T03:43:24Z

Your key statement is that this never (practically speaking) happens. That means that filtering is not a breaking change (practically speaking), nor does it produce different results for any reasonable input. Filtering out invalid inputs makes it easier to reason about the code and so it can actually improve things. Pretending that we have covered all the bases in the presence of infinite values is probably not a realistic thing to do. You have convinced me.

…

On Sun, Jan 11, 2026 at 7:34 PM tison ***@***.***> wrote: *tisonkun* left a comment (apache/datasketches-cpp#471) <#471 (comment)> Arithmetic with infinity in IEEE floating point *can* produce NaN, but only when opposing values are involved. Sorting works correctly because infinities compare with normal numbers. Since the centroids are always sorted, merging will only ever combine infinities of like sign. This means that infinity will creep slowly like a contagion into the centroids in the digest. If there are lots of infinite values in a sample, we may get a slight over-estimate of quantiles because an entire centroid will go to infinity if a single infinite value is introduced, but no single centroid is that large in a reasonably configured digest. In general inputs, we shall never see NaN, Inf, -Inf, or at least not too much. We're here to discuss edge cases. So the trade-off is that, assuming the sketches may have +-inf but they never combine to produce NaN, or just filter all +-inf. If we allow +-inf and assume they will never combine due to intermediate finite numbers, the question is how we program against the opposing values? Shall we mark the sketch in a broken state, and all operations then fail? Or crash the program? (Since NaN cannot compare with other values; or we can define a total order in some way.) — Reply to this email directly, view it on GitHub <#471 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB5E6QBGM4EWRD5F6TI4W34GMI57AVCNFSM6AAAAACQ5Y54J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMZWG43DONBVG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…tasketches-cpp into fix-tdigest-inf-params

proost · 2026-01-12T15:45:33Z

I reopen this PR.

I refer rust code, so I added missing validation which is deserialization. Thanks to @tisonkun .

I also add comments that what is expected and valid input too.

tdigest/include/tdigest_impl.hpp

tisonkun · 2026-01-12T23:32:45Z

tdigest/include/tdigest_impl.hpp

+  for (const auto& c: centroids) {
+    check_not_nan(c.get_mean(), "centroid mean");
+    check_not_infinite(c.get_mean(), "centroid mean");
+    weight += c.get_weight();


nit: check weight is not zero

Thank you. bded7aa

coveralls · 2026-01-12T23:35:17Z

Pull Request Test Coverage Report for Build 20928226794

Details

62 of 62 (100.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.004%) to 98.793%

Totals
Change from base Build 20832031911:	0.004%
Covered Lines:	17031
Relevant Lines:	17239

💛 - Coveralls

fix: rejecting inf as value

59e5f36

AlexanderSaydakov approved these changes Jan 8, 2026

View reviewed changes

proost mentioned this pull request Jan 8, 2026

[FEATURE] t-digest for go apache/datasketches-go#96

Open

proost closed this Jan 8, 2026

proost deleted the fix-tdigest-inf-params branch January 8, 2026 10:49

tisonkun reviewed Jan 12, 2026

View reviewed changes

tdigest/include/tdigest_impl.hpp Outdated Show resolved Hide resolved

proost restored the fix-tdigest-inf-params branch January 12, 2026 14:56

proost reopened this Jan 12, 2026

proost added 2 commits January 13, 2026 00:00

Merge branch 'fix-tdigest-inf-params' of https://github.com/proost/da…

c097fc8

…tasketches-cpp into fix-tdigest-inf-params

fix: check invalid inputs on deserialization

588fd73

proost requested a review from tisonkun January 12, 2026 15:47

proost added 3 commits January 13, 2026 01:04

perf: remove ostringstream

b8489fd

style: follow local convention

c680a81

fix: add missing dependency

99d06bf

tisonkun reviewed Jan 12, 2026

View reviewed changes

tdigest/include/tdigest_impl.hpp Outdated Show resolved Hide resolved

tisonkun reviewed Jan 12, 2026

View reviewed changes

proost added 2 commits January 13, 2026 15:20

fix: allow inf for get_rank

662aef3

fix: check weight is zero

bded7aa

doc: update throw NaN for get_rank

1979834

proost requested a review from tisonkun January 13, 2026 07:22

fix: rejecting inf as value #471

Are you sure you want to change the base?

fix: rejecting inf as value #471

Uh oh!

Conversation

proost commented Jan 7, 2026

Uh oh!

AlexanderSaydakov commented Jan 8, 2026

Uh oh!

tdunning commented Jan 8, 2026

Uh oh!

proost commented Jan 8, 2026

Uh oh!

proost commented Jan 8, 2026

Uh oh!

tdunning commented Jan 8, 2026 via email

Uh oh!

proost commented Jan 8, 2026

Uh oh!

tisonkun commented Jan 12, 2026

Uh oh!

tdunning commented Jan 12, 2026 via email

Uh oh!

Uh oh!

tisonkun commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdunning commented Jan 12, 2026 via email

Uh oh!

proost commented Jan 12, 2026

Uh oh!

Uh oh!

tisonkun Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

proost Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jan 12, 2026

Pull Request Test Coverage Report for Build 20928226794

Details

💛 - Coveralls

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tisonkun commented Jan 12, 2026 •

edited

Loading