28 January 2022

Disaster Recovery for WordPress, and specifically with an EC2 Linux Instance

A DR Backup is used in emergencies where a standard backup fails to restore a system.

A system could use RAID disk to increases reliability, but not necessarily guarantee a restoration. We do not use RAID due to cost.

On Amazon, one can create a “snapshot” as a very efficient and fast way to recover from a point in time. However, if the snapshot copies an already corrupted service, there is a problem. It is possible to restore an entire system from a snapshot within 10 minutes. That is fantastic.

Companies design their disaster recovery backups. This suggests there is no complete method, but an approach to reduce risk of loss.

As individuals, we may also design a good backup. This differs to a regular day-to-day or standard backup.

As far as I know, a reliable backup for WordPress itself is the plugin called All-in-One WP Migration. This is a superb tool. There are others as my colleagues also use. But let us suppose such backups also fail.

Before we create a backup, if using Linux, we go to the root directory via our Shell login as root user and verify a full recursive listing does not freeze:

cd /
ls -laR

If the listing freezes, we may not have a good backup. We could try an Amazon snapshot, then restore to a new volume and see if that fixes the issue. If the “freeze” is occurring on WordPress files, it will likely mean a permanent failure, but you would have to use your own skills to examine further to see if those files can be removed without impact. For instance, a plugin can be deleted and removed from the database. There are cases I think, where there is corruption when writing to the database, but the database itself would not know the data was corrupt in and of itself. This gives you an idea of why there can be recovery problems we had not expected.

If a website has become “messy” over time and there is a failure we cannot pinpoint after extensive work, there may be all sorts of “stuff” in the database hanging around or some hard disk corruption. After all, hardware has historically been a bad enemy. We could build a fresh site, right from scratch, and simply use WordPress and plugin exports and imports, then fix the menus and widgets by comparing to a development or localhost copy. This does work and is not too hard, but you need some skill and effort to do so. Obviously if you have a https://mydomain.com that is live, you need to have a development system to rebuild to a fresh copy of https://mydomain.com. You don’t have to delete the old system, only disable it so you can go back to it if you forgot something, then wait until the new build is good and tested before deleting the old. You cannot run both systems on the same https://domain.com at the same time.

We approach the backup in various areas and decide ourselves our extent of work around them:

(1) The hard disk operating system and files – a “snapshot” for everything under the Linux root directory

(2) A Unix tar file for the website (e.g. /var/www/html) – for instance:

cd /var/www/html
tar cvf backup.tar ./* ./??*

Check the tar file can list the contents:

tar tvf backup.tar

Then you would download the tar file to your PC or wherever you wish, such as into the cloud.

Note that the WordPress media files may have been altered if you turned on compression with a plugin such as eWWW Image Optimizer. Your original files should be in tact somewhere else as uncompressed images as a matter of protocol.

WordPress introduced a maximum of 2560 pixel width on images and will scale them down like it or not. In my view this is really bad, but there may be settings in your WP Theme to change that limit, or use a plugin. Again, this points to the need to keep clean copies of your original media files somewhere else.

(3) Any special files you have stored, for instance under /home/user (e.g. /home/ec2-user). As an example, you may have some copies of shell scripts you use with crontab. Other files could be copies of httpd.conf, php.ini, www.conf, my.cnf, ssl.conf and various others such as 00-proxy.conf or phpMyAdmin.conf, /usr/share/phpMyAdmin/config.inc.php and so on. You simply use these files as reference, and would not copy them to a new installation. Also use “crontab -l” to list any crontab work you did and put it into a text file on your PC.

(4) The WordPress database – an SQL database backup.

This has issues around mysql versions that potentially cause imports to fail. We can check the database is not corrupted using an mysqlcheck command, but we have no database rollbacks available. We are not a corporation with expensive software. The all-in-one WP Migration plugin should be fine, but we can go further.

If using the all-in-one plugin, I would suggest temporarily deactivating plugins related to caching and security so we only have the plugins activated that we really need for a restore.

We can also use phpMyAdmin (or Unix database backup commands) to download the database, however, when we restore a database it sometimes fails. In that scenario it can help to have all the fundamental WordPress tables backed up (you select only those tables for an export) and then make additional backups for the other groups of plugins. This is still no guarantee. What often occurs is that a restore fails to load the menus and widgets. This is where it helps to have a localhost or development copy of a website (such as with the all-in-one plugin) so you can compare a restoration with a copy of an original. You may notice that a localhost copy uses 127.0.0.1 instead of “localhost” in the wp-config.php file.

I am still old school and I prefer to stick with my database being latin1_swedish_ci. It causes me no problems.

There are some WP plugins that for whatever reason will sometimes not restore, even if you hunt around the forums for a fix and edit the database file. There is no easy way to discuss this so we have different angles at which we approach the DR backup in case one thing fails and there is another to go to. Consider this – you have a live retail website that a business depends on. It fails, for whatever reason, and you cannot restore it. This is a dangerous cliff face.

(5) You can backup WordPress pages/posts, media, plugins, and our theme – these are sets of files we Export (or write down manually). We also look for a WP Theme Export file and Global CSS settings.

(6) Written notes about how we did a number of configurations related to the operating system and WordPress – sometimes things get complex so we can’t remember how we did something. Keeping specific notes lets us rework a configuration from scratch or compare a new configuration to what we have previously done. We may have another website that uses the same configurations we can look at.

(7) Admin and User login details. Sometimes we forget this and it is problematic.

(8) Any operating system setups you need to refer back to, such as your DNS configurations, CDN, and so on. Think this way, in terms of Disaster Recovery. Imagine your whole system is gone, period. You have a blank slate and nothing to help you except your own files.

What are some typical things that get lost during a restore or transfer? Global CSS, navigation menus and widgets, and surprisingly, Sliders. This is where a development site comes to the rescue.

One may even make screen dumps to show the details for menus and widgets. However, I always, always make a text file backup using cut and paste of any widgets that use blocks, custom css or text. You just do not want to lose that.

This may all seem like a lot of extra work, but it is not your day-to-day backup, rather a disaster recovery backup for the unforeseen. It is up to you how much time you have, your skills, your level of risk comfort and reasonable liability. There are some folks who do not care at all, so it does not matter to them when they lose client data.

I would note that the plugin called “WP-Optimize – Clean, Compress, Cache” will be able to remove old copies of your web pages and posts, which greatly reduces the size of the database backup and restore. I like this, but would advise only to use this when a website’s current content is fully stable and you do not need reference to old pages. You can always take an important page and make it private.

Backups may fail when moving to a different database version. I have seen severe problems when moving from one web hosting provider to another because you have no control over the level of quality given by a provider. This is where it is really important to have a localhost or development system that allows one to compare during a rebuild. This can get tricky if you have to start editing the database yourself, but the all-in-one plugin I mentioned should remove these complexities.

Is it worth recording global CSS and theme settings? Yes. My experience is that at some point one may accidentally delete settings or make changes without pressing the SAVE button that are still saved! You need a precious copy of your CSS and theme settings. Each web page can usually have its own CSS settings too. A WordPress page (or post) export will keep these settings for you and can be viewed in an editor.

If you have any peculiar configurations, do keep a record. For example, you may modify a language file, or the permalink names for portfolios. Keep details on such things.

For day-to-day backups, we usually make a standard backup prior to minor software upgrades. I like using the snapshot method before significant changes.

At some point in our website administration, we will have site failures. This is a given. Even if we never upgrade software the hard disk could still corrupt files or a hacker could damage the service or WordPress, despite the security plugins etc. Even the hosting provider could issue a notice that either you or they need to move the site to new hardware, so a security upgrade is necessary.

DR is an approach and method that shows we have considered what we feel is sufficient. If we run into severe trouble, we are not in the same liable situation that would occur if we had not attempted to address DR backup as part of our work. We can demonstrate our DR approach as opposed to a designer acting dumb and saying they have no idea, that client data is lost!

Over the years in my IT work, whether a business or ASX listed company, I have seen backups fail. This is not the same as outages. Even preparations for various kinds of outages in data centers and companies have had incidents. For us, we are not building explosive proof building columns, hiring companies to take data tapes off-site, re-enforcing our roof structures to withstand a crash from a helicopter, using mirrored cascading computers in different geographic locations to manage aircraft landings, or diesel engines ready to supply power.

We ensure that our backups are derived from well protected sites. What is the point of delivering a system with concerns around lack of best practice? For example, why place a Forum database onto the same hardware space as WordPress? The security risk is too high.

As a note, hard disk failures are not likely to recover critical files when using Linux repair utilities. Many years ago things were pretty good using the fsck command, but these days other commands, for me anyway, have not restored working systems.

I have never lost a computer IT system that was literally in peril. But I personally have lost data and know that feeling. I have seen students lose their thesis or other lesser work and cry in front of the computer screen. These are real problems for today. I understand that loss. But my compassion is not quite justified when companies think they have installed secure backups, who then go merrily on their way and find the backups do not work in a crisis. True DR takes effort, testing, and maintenance in some way within those corporations.

I recall a power outage in North Sydney in the late 1990’s. The company had backup batteries, but the computer room was pitch black. There was no power in the batteries. Today we have other types of outages. For example, if our servers are not located in Australia, will the other country be exposed to typhoons? If our Internet service goes down, such as email failing for seven days, what would we have as backup to keep running? Again, in the 1990’s a famous Australian company almost lost it computer and data, for the entire company. I restored that system but was at a point in the restoration where I knew that when I pressed the ENTER key it would either fail completely, forever, or recover. I was in discussion with other professionals during the process. It recovered. During the outage, the company was able to use written books and ledgers to continue their business.

Another business, a hotel group, almost lost its data too. These critical situations were largely about insufficient disaster recovery backups and old IT systems. I cannot underestimate the need to keep reasonable pace with technology. If you are still stuck on WordPress version 3 !!! you have an exposure. If you are on PHP version5 or even today continuing on PHP 7.2 you have exposure.

Even highly regarded WordPress plugins can fail when they are upgraded, so a recovery is in terms of alternative plugins until the authors fix their plugins.

Another problem today is more important and obvious. That is the variable loads placed on systems. Most websites do not have this problem, but there are public sites of significant importance that have not built into them an architecture to sustain variable peak loads. To me this is not acceptable. Even some “lesser” websites have not been architected to meet reasonable public peak demand, so the sites are slow or freeze. Some sites may look good, but are they using caching tools and CDN? These are real problems. If your car sputtered at each set of lights or cross roads and you found ti difficult to drive off the mark, you would not put up with that. This however is exactly what a growing number of websites in the public domain are doing. I have no answer to this other than to say there are system architects who can design responsive systems, and that certain categories or segments of industry and government should be held to account to meet specific standards – as there are no such standard in place at present. Imagine a service could put a label on their site to day it meets industry standard ABC level 1, or ABC level 2 and so on. Then we know where we stand.

As an example of tomfoolery, data centers must record metrix (data) on their outages, show how they address an outage problem with staff training and system changes, and pay penalties. These are real drivers. But, we find some entities advertising to the public that they have near 100% uptime and yet have no such processes in place, so the experience is that the services fail and people jump onto the forums to give their stories about excessive and painful downtime.

We need standards – but they are only met by those corporations who have a good IT culture, finances, architects, and best practices.