There were significant problems with administration G: drives this month. Many staff were unable to get access to their shared folders for several days and some files are still unavailable. The fault caused considerable difficulties for the staff affected and has been a particularly difficult problem to solve.
This article describes in detail what happened, the steps that IT Services have taken to address the problem and the work that is yet to be completed.
Hardware fault
During the evening of Monday 12 March, there was a hardware fault in the storage server used by a number of systems, including the G: drives for staff in central administration teams, student webmail and several other services.
The fault did not cause a complete failure. Instead, the storage system continued to function with corrupted data. By Tuesday morning, the corruption to files had begun to cause problems for users.
First steps to resolve the issues
Normal procedure would have been to switch over to the real-time copy of the G: drive which is an exact copy of the disk. However, by this time many of the same corruptions in data had been copied over by the replication software.
So instead, IT Services staff attempted to repair the corrupted services. This took most of the day but by 6:45pm, we had fixed the majority of student email. One of the five mail stores was so badly affected that it could not be repaired and did not work reliably until Wednesday morning.
The process to repair the G: drive files was even more problematic. When we realised it would not be possible to fix the corruptions to the data, we began the process to restore the entire drive from backup tapes.
Restoring from backup
We make daily copies of all data on central storage systems. A complete backup, which takes a considerable time to create, is made during each vacation. We then take nightly copies of any files which have changed during the preceeding day.
Because the problem with the storage system happened towards the end of term, we therefore faced the worst-case scenario of having to retrieve data from over 50 backup tapes. Partly because of the very large number of records involved (just under 1.5 million files), this process began on Wednesday 14 March but did not complete until Friday afternoon.
Missing files
During the restoration process, we attempted to give read only access to staff so that they would be able to start working on recovered files as they were transferred from tapes. This helped to some extent but at the end of the process, it became clear that some files had still not been restored.
We very rarely have to restore an entire drive from tape and the process highlighted a flaw in the backup software that we use. Although all of the files had been safely recorded to tape, the index proved to be incomplete. As a result, a number of files could not be located in the backup tapes leading to gaps in the G: drive.
Rebuilding the index
We have now begun a secondary process to rebuild the index by re-cataloguing the contents of the backup tapes. We are working on this with the providers of the backup software but anticipate it will take a considerable time to complete.
Additional recovery work
Because of the likely delay, we have also gone back and recovered as many files as possible from the damaged storage disk. This task was completed on Friday 23 March and as a result, the majority of G: drive files have been returned to staff and are now available as normal.
Outstanding files
There are still a few files missing however. These are files which were affected by both issues: they were corrupted in the original disk and also missing from the backup index. We are in touch with a number of staff who have spotted gaps in their files and we will be able to restore them when the re-indexing of the backup tapes is complete.
Improving systems
An incident like this is extremely rare. In fact, a fault affecting data on this scale has not happened before at Sussex. But the problem has highlighted a number of areas where systems could be improved, particularly the type of hardware that is used to store data, the indexing of backup data and the time it takes to restore files from tape.
Despite the problems of the last two weeks, the shared G: drive remains the best way to work with corporate files. Other methods, such as free online storage systems (e.g. Dropbox, Google Docs) or synchronising with external hard drives have particular disadvantages. But in the coming weeks, we will be working to improve the resilience of the central storage systems to ensure staff and students can use them confidently and effectively.