Rails: "find_in_batches" vs."in_batches" in depth comparison

Rails: "find_in_batches" vs."in_batches" in depth comparison

Rails, just like many other frameworks, has a lot of magic (and surprises too). Last month, when I had a chance to do a performance tuning for the ActiveRecord #find_in_batches and #find_batch I found some interesting thing about those two similar methods that's worth sharing.

But first let's get into a quick intro to whom haven't heard about any of those methods.

Here's the API doc for find_in_batches:

Yields each batch of records that was found by the find options as an array.

And for in_batches:

Yields ActiveRecord::Relation objects to work with a batch of records.

That's easy to understand but also very similar. In fact the difference is so subtle if we don't jump into the details. But in short, find_in_batches yields each batch of records that was found while in_batches yields ActiveRecord::Relation objects.

So the following code:

Post.find_in_batches do |group|
   group.each { |post| puts post.title }
end

will only send one query per batch to database to retrieve all posts' data for the batch:

SELECT "posts".* FROM "posts" WHERE ...

However:

Post.in_batches do |group|
   group.each { |post| puts post.title }
end

Will send two queries per batch to database. The first query to get posts' ids for the batch:

SELECT "posts"."id" FROM "posts" WHERE ...

And the second query to get all posts' data for the batch:

SELECT "posts".* FROM "posts" WHERE ...

More details:

If you look in to the source code for those two functions here, you will see that find_in_batches actually calls in_batches with load: true passed in the argument. However the default value for load is false in in_batches.

And if you look further in the in_batches for the part that uses the value of load, it will look like this:

        if load
          records = batch_relation.records
          ids = records.map(&:id)
          yielded_relation = where(primary_key => ids)
          yielded_relation.load_records(records)
        else
          ids = batch_relation.pluck(primary_key)
          yielded_relation = where(primary_key => ids)
        end

I hope this post makes it clear for you guys who trying to find the differences between find_in_batches and in_batches. Knowing the differences will help developers to use Rails' Active Record more efficiently.