Gogo:Code
Providing PHP Programming Services to Design Professionals

Interested in Electronics?

Check out James' Embedded AVR/Arduino Development, and New Zealand Electronic Component Store

MySQL Disaster Recovery 

Wether you have done something silly, or had a hardware failure, or a security breach then you have had a disaster, and need to recover your data to a point as close to the latest data as possible.

The first thing to understand is that there is no 'Undo', not in MySQL, not in any other database server. It's simply not possible to 'reverse' a database query that has been committed to the database, it's committed, that's it, it's there and it's staying.

The second thing to understand is that depending on your backup regime and amount of data the restoration process can take a VERY long time, I recently had to restore a client's database from a month old full backup & around 1.2 gig of logged transactions since then, the 'playback' of the logs took, get this, 9 hours, and that is just 1.2 gig of data. Playing back logs is a very slow process, especially for transactional tables (InnoDB).

Backup Regime

There are effectivly two types of backup, first is a full backup, either by hot-copying the database files, or by using mysqldump to produce SQL to recreate the database at a certain point. The second is the 'binary logs', which log (provided you have them turned on) every query that changes data sent to the database server, including DDL (Data Definition Language) statements such as create table, and permssion statements such as grant. The logs can be treated as incremental backups.

Now, the ideal situation is to have to restore a minimal amount from the logs, that's a slow process, very slow. However it potentially allows you to restore right up the the point of failure (or if you made a silly mistake, even past the point). So the idea is that we need frequent full backups, and copies of the binary logs since that point.

I suggest something like a full backup every week (either hot copy or mysqldump), then every day, or even hour if your data is that important, flush the logs and move the flushed logs to the archive of the current full backup, this way even if your database server goes up in flames you will only lose 1 day's (or hour's) data. So in the typical worst case you have to restore a full backup, plus a weeks worth of logs, and will lose 1 day's data. Of course if the live logs survive the disaster then you need not lose anything, and this is a good reason that your mysql should be configured to store the binary logs on a different server (or at least disk) than the database itself.

It's quite important when making the full backup that the logs are flushed immediatly before the backup and stored to the previous full backup, this is because, currently (to my knowledge) MySQL does not have any way of knowing when replaying the binary logs which statements need to be executed and which are already comitted. So if we know our logs are synced to our full backup schedule our job is made a lot easier.

The process basically then is...

  1. Every Week
    1. Note the current log
    2. For each database (including 'mysql' database itself)
      1. lock tables (read lock only)
      2. flush tables (this forces unwritten data out to disk)
      3. flush the logs
      4. use mysqldump to create the new full backup
      5. unlock tables
      6. COPY all except the current (newly created in step 3) log to the previous full backup directory for the database, each database backup needs the logs up until the flush logs for that database, if yo uwish you could run the logs through mysqlbinlog to extract in text format only the statements for that database (eg where the databases on the server are administered by different people who should not have access to see data in other databases)
    3. remove the noted log (step 1) and any older ones
  2. Every day ("Incremental Backup")
    1. flush logs
    2. for each of the databases
      1. find the last log that was copied into the archive, then copy all newer logs except the very last one (whic was created in step 2.1 into the archive directory (again, through mysqlbinlog -d <databasename> if you wish)
      2. remove all the logs prior to the latest one (created in 2.1), they have now all be backed up into each of the database diretories

You probably want to keep 2 weeks of full backups at a minimum, 4 weeks is better.

When we're done our backup folder (which you access through FTP/NFS/SFTP/.... on another server of course, or at least a different drive) for a complete week on database 'somedatabase' will look something like

/backup/somedatabase/week1/full.sql
/backup/somedatabase/week1/log_0101
/backup/somedatabase/week1/log_0102
/backup/somedatabase/week1/log_0103
/backup/somedatabase/week1/log_0104
/backup/somedatabase
/week1/log_0105
/backup/somedatabase/week1/log_0106
/backup/somedatabase/week1/log_0107

So you can see that we have all the information needed there to recreate the database if the unthinkable happens (except grant statements, which you'll need to extract from the 'mysql' database tables).

The features of this regime are that we only lock the tables for one database at a time rather than locking all databases while we export everything, each database has a distinct backup, we can recover quickly from a problem having to replay at most 1 week of logs, and each backup is in plain text SQL so it's easy to fix problems if necessary.

Disaster Recovery

Uh oh, the worst happened huh, or at least you want to test your backups. The plan is a simple one, restore the database that needs restoring using the full backup + the statements from the logs that pertain directly to the database in question (remember that raw binary logs contain the statements for ALL the databases on the server. Note that it is not really possible to retsore a single table, you must restore all or nothing - work is going on at MySQL to improve the binary log to allow the extraction of per-table statements which will allow single-table-restore.

  1. If the problem is actual datafile corruption, first try a REPAIR TABLE <tablename> on each of the tables, nine times out of ten this will fix the problem and you don't need to do a restore.
  2. If you want to restore right up to the current point in time (if for example a mistake was made but it doesn't invalidate writes made since the mistake) and there are any logs available that are not in the current backup (and newer than the ones in there), copy them in to the backup directory now (after a flush tables, flush logs).
  3. Make sure nothing is running (eg website) which will try and modify the database while you are restoring. If you can it's a good plan to say temporarily change the port that mysqld is running on so that people simply cant connect to the server while you're working.
  4. if you can, shut down the server and take a copy of the files in the data directory, then restart the server
    this is a 'just in case' measure, really it's just to satisy that burning need to backup before you do the next (scary) step
  5. drop the database, yes, we must kill that which we wish to save
  6. Take your last full backup sql file and feed it to mysql (tip : if your tables are innodb, you might want to do SET FOREIGN_KEY_CHECKS=0 before the import and SET FOREIGN_KEY_CHECKS=1 after it, this will speed up the process).
  7. For each of the incremental (log file) backups in order
    1. run the log through mysqlbinlog -d <databasename> to extract the sql for the specific database, you may have done this already in the backup process above
    2. If the problem was a silly mistake or security breach etc then you will want to edit the sql to remove the bad queries first, hopefully the log is not too big (they should each only contain about a days queries at most), you will have to be careful about doing this that it doesn't mess up future queries, it should be ok but it does pay to have a good close look so you can make sure.
    3. feed the (possibly edited) sql from mysqlbinlog into mysql
  8. Have a look through the restored data, if anything is wrong then fix it.
  9. Make sure the grants are all in order, if you have access to the backup of the 'mysql' database then you can have a look in there to determine who should be able to do what in your database.
  10. You probably want to do a full backup now while you still have everything 'offline'.
  11. Let everybody back in.

Double Disaster Recovery

What could be worse than having a MySQL disaster, well, having your backup corrupted and having a disaster of course. But all is not lost even in this case it may be possible to restore just as much data, but it could take a little longer.

Because each of our full backups has (in separate files) a full backup sql, plus extra files with (effectively) more sql to bring it to the current state, any given backup is equivalent to the immediately previous backup + the extra files from the backup.

  • Backup{N} == Backup{N - 1} + LOGS_FROM(Backup{N})

So if when you come to restore you find that your latest full backup sql file is corrupt, then you can substitute the previous full backup sql file + the previous set of logs to get the same data.

A corruption in one of the logs though is harder to work around, you have two choices, either simply restore only to the point of the corrupted log, or restore up to the corrupted portion, resolve the missing section of data manually, and then continue restoring from the corruption forward. That said, you'd have to be a pretty unlucky soul to have a corrupted log and a database failure all at the same time.

Conclusion

When a disaster happens, all you want to do is get the database restored, as completly as possible, as quickly as possible. Doing this requires absolutely that you have all the data you require to do this redunandtly ad independantly backed up. The backups should be as resistant to error as possible (many small files rather than 1 huge file), and as manageable as possible, while impacting on the operation of the database (in downtime) as little as possible.

The binary log functions of MySQL provide adequate means of incrementally backing up data with very very little overhead (a second to flush the logs once a day), and a careful use of mysqldump and flush logs can limit the downtime of the database to once a week for hopefully not too long (depends on how much data you have).

People may balk at the idea of the database being down any length of time for 'maintenance', but the slight inconenience of an hour per week in the off-peak time is a vastly superior option to the heart ache and hours upon hours of downtime trying to restore from a month old backup.