Leveraging bulk inserts instead of CreateMany() in AdonisJS

Adonis is great. It, like Laravel, provides a lot of nice functionality that make life easy as a developer. The ORM is pretty pleasant to work with, however it's also very easy to ignore performance and ship something suboptimal without thinking about it.

Which, of course, is exactly what I did.

I'll try to explain without giving away too much inside information about the application I work on full time. In one section of our application we allow users to search for businesses, to generate an audience comprised of many (up to 10,000) locations based on whatever criteria; business name, type of business, etc.

Once a user has made, and narrowed their selections, we generate polygonal data associated with each business' physical location. I'm not going to delve into what else happens as that's well outside the scope of this post. But suffice it to say, this action results in the creation of many records in the database. Three distinct types of record in fact:

First, there is the audience record itself, this is general metadata for the user's consumption within the application, a name that is shown within lists, etc.

Next, there are two different geometry-based records. One that, again is lots of metadata, the other that contains the actual polygon and so forth.

There are reasons it's split out the way it is... in the simplest of terms it's because the record with the actual polygon can be reused and in a previous iteration of the application that data was duplicated, in several instances many times over.. so normalization was one of the specific design goals when I rearchitected the data structure.

Anyway.. so those records all need to be created, and are all inter-related.. the geometric records belongTo the audience, and belongTo the polygon. So there's a bit of hoop jumping that has to occur to generate things in the correct order.

Previously I was filling the polygon records with the bare minimum content so that I could get an id to populate into the associated geometric records, and then a process further down the line was verifying the lat/lng of the address to see if a better match could be found, then generating the polygon and updating those records – that process is no longer involved, so I can populate polygons immediately, which is nice.

But in either case, what I was winding up with was, essentially, an array of ids to populate into the geometric records as the relation id when I created those, and I was doing so making use of Adonis' createMany() method. Which is specifically intended for that purpose. Of course most people probably aren't creating upwards of ten thousand model instances at a time... and given that these are individual queries, as is stated in the docs:

The createMany method makes n number of queries instead of doing a bulk insert, where n is the number of rows.

Not really ideal for my use case. But it's been working alright, just a little slower than ideal.

Enter: Bulk inserts

To be fair, I was already using bulk inserts in the aforementioned bare-minimum insert to get polygon record ids.. a pretty clever query too in fact;

const polygon_id_result = await Database.raw(`
  INSERT INTO polygons (created_at)
    SELECT NOW()
    FROM generate_series(1,?) i
  RETURNING id`, [hits])
const polygon_ids = polygon_id_result.rows.map(r => r.id)

hits here is the number of records. Postgres has that interesting generate_series function that creates an array.. so the query basically just says "for these n array items, generate empty records containing just a created_at timestamp"

Then I would do my createMany thusly:

await Geometric.createMany(pois.map(poi => {
  return {
    name: poi.name,
    // other fields are not relevant... 
    audience_id: audience.id,
    polygon_id: polygon_ids.shift(),
  }
}))

See, that takes the array of polygon_ids and chucks them in, in order, to the appropriate records in the createMany() method. Works great.. but generates like 10k requests.. Yuck.

Instead now I'm doing bulk inserts for both parts.. making use of lodash's chunk and flatMap though with node 11 or newer flatMap is available natively now...

Basically I build up an array of items as I did before, then chunk that into groups of 500, which is an arbitrary selection, but keeps the overall query sizes reasonable as I'm inserting 5 and 8 columns respectively for each of the two tables' queries.

So now, instead of the bare-minimum bulk insert, I do a bulk insert with actual polygonal data, grab those ids, and feed them in to a second bulk insert into the geometric model's table. I'm not showing all the fields here, but you get the idea:

// Break up our array of items into chunks of 500
const chunks = _.chunk(pois, 500)

// Iterate through each chunk of 500 items
chunks.map(async (chunk) => {
  const polygons = await Database.raw(`
    INSERT into polygons (geojson, square_meters, metadata, created_at, updated_at)
    VALUES ${Array(chunk.length).fill('(?,?,?,?,?)').join(',')}
    RETURNING id
  `,
  _.flatMap(chunk, (item) => {
    item.geojson,
    item.square_meters,
    JSON.stringify({ address: item.address, point: item.point }),
    created,
    created, // Twice because updated_at == created_at on creation.
  }))

  // Extract record ids from the query result
  const polygon_ids = polygons.rows.map(r => r.id)

  // Time to replace that poorly performing createMany() with some bulk 
  // insertiony goodness...
  await Database.raw(`
    INSERT INTO geometrics (name, audience_id, polygon_id)
    VALUES ${Array(chunk.length).fill('(?,?,?)').join(',')}
  `, _.flatMap(chunk, (item) => {
    return [
      item.name,
      audience.id,
      polygon_ids.shift(),
    ]
  }))
})

Ok.. that's a little involved, lets walk through it. First we split the full array of items into chunks of 500, then iterate through each chunk and insert polygons, returning the created ids.. then take those created ids, and insert values into the DB for the geometric models associating the appropriate polygon id with each.

The interesting bits here, to me anyway, are the Array().fill() which takes the number of items in the current chunk, and creates an array with that many instances of the value passed into fill(), in this case, the placeholders for each record being inserted. That array is then join()ed with commas to generate the string of placeholders for each row to be inserted.

The resulting queries end up looking something like this:

INSERT INTO foobar (foo, bar, baz) 
VALUES (?,?,?),
       (?,?,?),
       (?,?,?),
       (?,?,?),
       (?,?,?)

Extra linefeeds added for clarity, of course

Then the flatMap takes the array of items, and for each returns an array of the data for each of the relevant placeholders – which then gets flatenned (hence the name), so that it effectively goes from something like this:

[
  { color: 'blue', flavor: 'raspberry', id: 1 },
  { color: 'red', flavor: 'strawberry', id: 3 },
  { color: 'green', flavor: 'apple', id: 7 },
]

To this:

['blue','raspberry',1,'red','strawberry',3,'green','apple',7]

Which is the format that has to be passed in as replacements for the placeholders when running those queries.

At the end of the day, this change goes from insertion/creation of 1 audience, 10,000 polygon records, and 10,000 geometric records taking a full minute or longer, to completing in about 6 seconds.

Additionally, it's far fewer queries to the database (41 if my math is correct, vs 10,002 or so previously), which means far fewer network calls that could potentially fail, and far less likely that the request will simply timeout before completing at all.

41 still sounds like a lot of course, and it is.. but it's not even the same sport, let alone in the same ballpark as 10k queries.

I anticipate utilizing this pattern more often, and while working out how to manage the population of the placeholders, and mapping the data for the replacements presented a bit of a challenge until I thought to use Array fill and flatMap, it's one of the more interesting solutions I've found lately.

To make a long story short; if you're inserting more than maybe a dozen or so records, you almost certainly do not want to use createMany() ... whip yourself up a nice performant bulk insertion instead, and avoid the problems poor performance cause before you create them in the first place.

Given the similarity between Adonis' Lucid ORM and Laravel's Eloquent, I suspect virtually the exact same solution would also apply to Laravel.. with the obvious PHP syntax changes of course.