Database migrations in a complex branching system
In our current development workflow we have introduced database migrations (using Ruckusing) to keep our developers' db schema's in sync. It works great, is pretty straightforward in use but now we have switched to git as VCS we are facing the next problem in our database versioning system.
When checking out a branch that has been in development for some time it might happen that the database schema has diverged alot from the schema in the branch i'm coming from. This causes database conflicts in some cases. Logically it seems that we need to run migrations depending on the branch we were on previously but that can get complex really fast and will run into problems with some people for sure. And as far as i know there isn't a db migration system that is branch-aware??
Added complexity comes when switching to a feature branch we might need to run some migrations up while other down ... technically this seems impossible using our current dbmigration scripts, are there any sane alternatives? Are there any preferred ways of working with database migrations in a very active and branched development system?
I think the whole idea of incremental migrations is pretty rotten, really. In a complex environment like yours, it really doesn't work. You could make it work for simple branch patterns, but for anything complicated, it will be a nightmare.
The system i'm working with now takes a different approach: we have no ability to make incremental migrations, but only to rebuild the database from a baseline. During initial development, that baseline was an empty database, and during maintenance, it's a copy of the live database (restored from a dump). We just have a pile of SQL and XML scripts that we apply to the baseline to get a current system (migrations, essentially, but not designed to be run incrementally). Updating or switching branches is then very simple: nuke the database, load a dump to establish the baseline, run the scripts.
This process is not as quick as just running a few migrations, but it's quick enough. It takes long enough that you can go and get a cup of coffee, but not long enough to get lunch.
The huge advantage is that starting by nuking the database means the process is completely history-independent, so it doesn't need to know or care about crossing branches, going back in time, or anything else.
When you take a release live, you obviously do things slightly differently: you don't nuke the database or load a dump, because the system is already at the baseline (the baseline is defined as the state of the live system!). You just run the scripts. And after that, make a fresh dump to be used as a new baseline for development.
I wouldn't really agree with incremental migrations being rotten. Having a set of homegrown scripts would, in my opinion, be a worse approach than having a real tool for such a job will make tracking those changes easier. I've had to deal with a similar situation myself, before, so hopefully I can share some of the insights.
To my experience RDBMS-schemas and branches don't mix very well. Depending on your branching the schemas should probably be at least somewhat similar, in which case the migrations should not differ too much. Or I might just have misunderstand the full extent of the problem. If you're e.g. trying to keep customer specific code on a branch, then maybe you should consider a way to modularize it instead. We did something like this, having rules that stated that customer specific schema changes, and code could only ever depend on the common code base, not the other way around. We also set the precedence between module changesets based on module and date, so we for most parts knew the order in which the changes were to be applied. YMMV, of course, but it's hard to give specifics, without knowing your current setup.
At my old company we successfully used a tool called Liquibase, which sounds similar to what you're using. Basically it is a tool for taking a DB schema, and all the data from one known state to another known state. The same changeset is applied only once, since liquibase maintains a changelog, with checksums. The changelogs are written in a specific XML format. I can strongly recommend to try it out, if you need alternatives.
Anyway, the way we handled customer code and branches, was to have a specific DB/schema for a given branch. This way you could have the schema and data from the branching point, and only migrate the diff to the current situation. We did not undo changes, even if liquibase in theory could support this, as we felt it was way too cumbersome and error prone. Given that liquibase keeps it's own state, the migration was always as easy as taking the current state on a given branch, and apply all. Only new changesets were applied, leaving the schema in a good state.
We used mercurial, which is distributed, like git, so the setup was quite similar. We also had developer specific local DBs on the dev laptops, and a number of environments, both for different customers and phases (development, integration, production), so the model was put to a real test, and it worked surprisingly well. We had some conflicts in the changesets, but we were mostly able to resolve those soon after the problem was introduced. The local development envs were really the hardest part, since during development some schema changes might have been introduced, which were not always compatible with later changesets, but the structured nature of the changes, and having a known state to revert to lead to very few real problems.
There are a few caveats with this approach:
- All and any changes to the schema must be implemented in the changesets. The biggest cause of confusion was always someone just fiddling around a bit.
- The first point also applies, even if you're using a tool that modifies the schema, e.g. a ORM-tool like Hibernate. You need to be pretty intimate with this tool to understand the changes it makes and requires.
- All users must buy into this, and be educated to follow the rules. Check 1.
- There comes a point when migrating lots of changesets starts taking too much time. At this time you will need to create a new baseline, which can be a bit tricky, especially with a lot of branches. It's good to plan ahead for this as well, and at least know of all existing DB-branches.
- You need to plan ahead a bit with the branches, to know whether they're going to migrate back to master at some point. Naive merging might not work well for schema changes.
- For very long lived branches and separated datasets this model might not be strong enough
The point is, however, that the more structure and control you have over the database, the easier migrations will be. Therefore tools like Liquibase could be a really valuable asset to help you track those changes. This applies to more complex models even to a greater extent, than to simple ones, so please at least don't consider dumping all tools you already have in place. And take some time to explore other alternative tools.
Some structure and control is better than none, or even worse, thinking you are in control with a big bunch of manual scripts.
I'm in a similar situation where I work on a live website and several development branches in which I need to change the database schema.
I solved it by writing a post-checkout and a post-merge hook that can be used nicely with git. I store all my migrations in the form of SQL files in a separate directory and commit them alongside the changed PHP code. Each time I perform a
git will automatically call the appropriate up- and down-migrations. See my implementation on Github.
As a special request (for those of you who don't want to follow the github link) some more explanation:
Consider the following scenario. You have two branches:
- master - which contains the website that's currently online
- feature - which contains an unfinished new feature
For the new feature to work properly, it needs to change the database schema. The workflow is as follows:
When, in you feature branch, you change your code which needs a change of the database schema, you also commit two new SQL files in the migrations directory, say:
- 20151120130200-extra-field-up.sql (containing all the SQL queries to migrate upwards)
- 20151120130200-extra-field-down.sql (containg all the SQL queries to migrate downwards)
- When you now perform a checkout to master, the post-receive git hook will:
- find all *-down.sql scripts in the commits from <new HEAD>..<old HEAD>
- execute those scripts with the local database
- find all *-up.sql scripts in the commits from <old HEAD>..<new HEAD>
- execute those scripts with the local database
- When you merge your feature branch into master, the post-merge hook will:
- find all *-up.sql scripts in the commits from master..feature
- execute those scripts with the local database
Simply copy the post-checkout and/or post-merge file to the .git/hooks directory of your own git repository. You can edit the config section of those files. See the files themselves for an explanation.
The naming of the migration SQL files is crucial. They should end with up.sql or down.sql. The rest of the name is completely up to you. However if you have a single commit with multiple up-migrations and/or multiple down-migrations the order in which they are performed depends on the lexicographical order. Migration files that are within different commits, will always be called in the same (reverse) order as the commits.
It's not a requirement that you have both an up-migration and a down-upgrade, nor is it a requirement that up- and down-migrations are named similar.
An approach I'm thinking of testing in our current project is to create a branch 'migrations' and all (and only) migrations are committed to this branch. Developers must merge from this branch into their current branch before creating a migration so that their migration is always created on top of the latest migration. All projects merge from this branch, so that every branch has a concept of a linear migration history. This gives every branch the ability to move back and forth between database versions. When switching to a branch that depends on a different version of the database, the developer applies whichever migration is appropriate.
The annoyance (besides the extra work and diligence of committing migrations to the special branch) is remembering which migration corresponds to a particular branch. I guess one way of doing this is instead of committing migrations directly onto the migrations branch, commit the migration (and only the migration) onto the current branch and then cherry pick that commit onto the migrations branch. Then you can just look at the last time the current branch cherry-picked onto the migrations branch and know that that diff contains the necessary migration. I think that would be possible. Also, the developer might create a migration just to see what changes would be necessary, and then try to infer which migration would be appropriate to use.
Sorry for the vague suggestion; if we do end up trying this approach I'll edit this suggestion with more concrete recommendations.
This is something I've been working on lately. For me the problem haven't been that database schema has diverged per se, but rather, git can't merge them together. Feature branches which touch the database schema are always scary.
The solution I've been thinking on is that instead of having linear migrations, have migrations that depend on other migrations. You get a nice dependency graph of your migrations which is easy enough to linearize (topological sort). Just keep track of the named migrations in your database and in the correct order execute the updates that aren't already updated.
For example, addCustomerSalt depends on initialSchema, and separateAddress depends on person.
The one problem this does not solve is that if branch A depends on update Z which has been created in branch B, but maybe in that case, you should rebase to a common ancestor?