/ Programming

Data Migration Woes #1: How I improved an import by 5x with one extra query

I don't know off hand if there will be more posts, but the potential seems likely, so we'll just go ahead and call this installment #1.

I'm in the process of rearchitecting the data layer of an application that we have in production presently that's been cobbled together by a variety of people, some with a high probability of questionable competency, over several years... It's the classic story – I've inherited a mess.

We've limped it along for a good year and a half since my joining the team, already doing one round of major refactoring, completely reimplementing the front-end. But sadly the backend is just not cutting the mustard.

I guess it would be more appropriate to call it the server-side component of our UI application, because the backend api is a whole other ball of wax that I am not particularly involved with.

Anyway, someone for whatever reason decided to build this application on Meteor, which I guess probably made sense one way or another when it was a small application, with few moving parts, and a lot fewer active users.  We are constantly battling random stupidities that have little other explanation than "Meteor is shitting the bed."

Mongo connections will periodically fail, and then simply refuse to reconnect. Lots of miscellaneous stupid like that. Mongo is another source of problems.

Lets be clear, our data is specifically relational. There is no reason to not treat it that way, and there is no real need for a schema-less document store either. Mongo has it's uses, our app is not really the right place for it.

SO... I made the decision to move it to postgres. I used to always be a MySQL guy (well, MariaDB more recently) and never really understood the appeal of postgres, it struck me as being much less approachable, and more involved to understand – it is, but that's because it can do so much more than mysql.. and really incredibly fast too. So, just in my little bit of fiddling with it the past several days, I can say I'm a convert, definitely. I'm all on board the postgres train!

Anyway, lets get on with todays challenge, shall we?

I wrote a migration script for one of our primary pieces of data today. Tested it on about 97k records, which happened to be in our staging/testing database, and it pretty quickly bombed out. For good reason, the new postgres schema has a unique index in place because we've had issues before of people naming things identically and then not being able to track down specifically which thing was being referred to. Sure enough, duplicate key error.

"Alright," I say, "lets build something to append an auto-incremented value to the end of these duplicate names, and job done."

So after a bit of googling to see if something already existed within postgres to do this sort of thing (it very well may, but I sure didn't find anything) I came up with this bad boy:


let result = await Pg.query('SELECT id from thing where name=$1 and foreign_id=$2', [item.name, fid]);

if (result.rows.length) {
  let dupCount = 0;
  
  // oooo, a WHILE loop! Fancy.
  while (result.rows.length) {
    dupCount++;
    
    result = await Pg.query('SELECT id from thing where name=$1 and foreign_id=$2', [`${item.name} (${dupCount})`, fid]);
  }
  values[0] = `${item.name} (${dupCount})`;
}

await Pg.query(query, values);

That all happens within a larger for loop that's iterating through each record being migrated.

So.. it works. Yay! BUT there's one small problem...

You see, dear reader, some people apparently think it's a good idea to name not just tens, or hundreds, but THOUSANDS of things identically. So that means every time a duplicate is encountered, it loops through potentially thousands of extra queries until it finds a unique name. Since I actually care about my time, and I'm likely to be running this many more times as I continue fleshing out all the migration logic, this just won't do.

"This just won't do," I say to myself, "there must be a way we can shorten the amount of time spent finding a unique title, and I don't want to resort to appending a hash of some sort... hmmmm..."

Simple. Query for the most recently inserted item, matching the naming pattern, with the longest string.  

Ok, so really just finding the last inserted by id would be sufficient, but an extra sort option won't hurt anything...

So I added the following within my while loop, just before its query:

// Since we could potentially have a _lot_ of duplicates, lets short-circuit this loop by grabbing the current largest dupCount in the database.
if (dupCount > 1) {
  const largest = await Pg.query('SELECT name from thing where foreign_id=$1 and name like $2 ORDER BY id desc, char_length(name) desc limit 1', [fid, `${item.name} (%)`]);
  const matches = largest.rows[0].name.match(/\((\d+)\)$/);
  if (Array.isArray(matches) && matches[1]) {
    dupCount = +matches[1] + 1;
  }
}

So now, instead of looping hundreds or thousands of times, if we have more than one duplication we grab the largest dupCount value, increment it by one, and continue on with life.

The results speak for themselves...

Before:

Migrated: 96,905 thing records in 1801.796s

And after:

Migrated: 96,905 thing records in 407.912s

30min to less than 7min

That's a worthwhile optimization!

Data Migration Woes #1: How I improved an import by 5x with one extra query
Share this