Skip to content

[FEAT] join with inequality #795

@r2evans

Description

@r2evans

What aspect of the package is your request related to?

join is great but one thing I do very frequently is join on an inequality (range or one-sided). (I could not find this in existing issues, my apologies if it is a repeat.)

dplyr supports join_by(between(val, x1, x2)) for "perfect" conciseness; I believe it is implemented as a list of tilde-expressions (that's what running join_by(x >= a, x <= b) outside of left_join(.) suggests). data.table supports on = .(val >= x1, val <= x2) or on = c("val >= x1", "val <= x2") for more control (e.g., > instead of >=) but less-terse.

If you are not familiar, neither dplyr nor data.table consider any operation within their joining expressions, so join_by(between(x, a-1, a+2)) does not work (nor is it in my request). I suspect without verification that join_by(between(..)) may not be referencing dplyr::between directly.

Describe the solution you'd like

One of the following:

  • string-expressions similar to data.table, such as on = c("val >= x1", "val <= x2")
  • list of expressions, either captured with substitute() or using rlang's tilde-expressions, such as on = list(val >= x1, val <= x2) or on = list(~ val >= x1, ~ val <= x2)

Having lists of expressions (either form) would support later expansion into fbetween() and similar. Regardless, if the first implementation of this were to be just strings of expressions, it would be simple enough on future enhancement to run something like if (is.character(on)) on <- convert_to_expressions(on) to support backward compatibility of the first implementation.

Additional context

dplyr has additional "helpers" within join_by() including between(), within() (I dislike this collision), overlaps(), and closest(). I say this not that I'm requesting all of this on the first attempt, but it would be useful to consider your goal-architecture so that your first stable release can be extended at a future time if need be.

Examples

x1 <- data.frame(x=1:10)
x2 <- data.frame(a=c(2,9), b=c(4, 20), z=T)
join(x1, x2, on=c(x="a"))
# left join: x1[x] 2/10 (20%) <1:1st> x2[a] 2/2 (100%)
#     x  b    z
# 1   1 NA   NA
# 2   2  4 TRUE
# 3   3 NA   NA
# 4   4 NA   NA
# 5   5 NA   NA
# 6   6 NA   NA
# 7   7 NA   NA
# 8   8 NA   NA
# 9   9 20 TRUE
# 10 10 NA   NA
join(x1, x2, on=c("x >= a", "x <= b")) # SUGGESTION
# Error in ckmatch(xon, xnam, "Unknown x columns:") : 
#   Unknown x columns: x >= a, x <= b
left_join(x1, x2, join_by(x >= a, x <= b)) # dplyr
#     x  a  b    z
# 1   1 NA NA   NA
# 2   2  2  4 TRUE
# 3   3  2  4 TRUE
# 4   4  2  4 TRUE
# 5   5 NA NA   NA
# 6   6 NA NA   NA
# 7   7 NA NA   NA
# 8   8 NA NA   NA
# 9   9  9 20 TRUE
# 10 10  9 20 TRUE

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions