Skip to content

Make gush handle wokflows with 1000s of jobs #55

@Saicheg

Description

@Saicheg

Hello!

First of all, thank you for this great library. We've using this for couple of our project and it's been really great.

Issue i am facing right now is that my workflow with 1000s of jobs is dramatically slow because of gush. Let's put some example here:

class LinkFlow < Gush::Workflow
  def configure(batch_id, max_index)
    map_jobs = []

    100_000.times do
      map_jobs << run Map
    end

    run Reduce, after: map_jobs
  end
end

Playing around gush i found out that after each job completed it has to visit all dependent jobs ( which is Reduce in our case ) and try to enqueue them. But in order to understand if this job can be enqueued gush needs to understand that all dependent ( Map in our case ) jobs are finished.

https://github.com/chaps-io/gush/blob/master/lib/gush/job.rb#L90

Problem with this code right now is that for each Map job after every Map jobs is finished it will call Gush::Client#find_job.

This will produce massive number of SCAN operations and dramatically decrease a performance because of this line of code.

https://github.com/chaps-io/gush/blob/master/lib/gush/client.rb#L119

I am not sure what is best solution in this case. I've tried to solve this problem by changing way of how gush stores serialized jobs. Idea is instead of storing jobs individually to store them on hash for every workflow/job type. Already have my own implemenation, but have to play around with benchmarks around this:

rubyroidlabs@4ee1b15

@pokonski what do you think here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions