-
Notifications
You must be signed in to change notification settings - Fork 110
Description
Hello!
First of all, thank you for this great library. We've using this for couple of our project and it's been really great.
Issue i am facing right now is that my workflow with 1000s of jobs is dramatically slow because of gush. Let's put some example here:
class LinkFlow < Gush::Workflow
def configure(batch_id, max_index)
map_jobs = []
100_000.times do
map_jobs << run Map
end
run Reduce, after: map_jobs
end
end
Playing around gush i found out that after each job completed it has to visit all dependent jobs ( which is Reduce in our case ) and try to enqueue them. But in order to understand if this job can be enqueued gush needs to understand that all dependent ( Map in our case ) jobs are finished.
https://github.com/chaps-io/gush/blob/master/lib/gush/job.rb#L90
Problem with this code right now is that for each Map job after every Map jobs is finished it will call Gush::Client#find_job.
This will produce massive number of SCAN operations and dramatically decrease a performance because of this line of code.
https://github.com/chaps-io/gush/blob/master/lib/gush/client.rb#L119
I am not sure what is best solution in this case. I've tried to solve this problem by changing way of how gush stores serialized jobs. Idea is instead of storing jobs individually to store them on hash for every workflow/job type. Already have my own implemenation, but have to play around with benchmarks around this:
@pokonski what do you think here?