Storage Spaces recovery war story

Storage Spaces recovery war story

Storage Spaces is a technology that is mainly used by large file servers. However, Microsoft has brought this technology to Windows 10 and Windows Server Essentials – what used to be the Small Business Server. You may use this technology when having a large disk space that can dynamically grow, by stripping many physical hard disks. This is a software RAID system that lets you create a Storage Pool out of many physical hard disks. You can have a simple spanning configuration (RAID 0), a mirror configuration (RAID 1) or a parity configuration (RAID  5). Storage Spaces lets you add more disks as well as remove old or broken disks. If you have a single disk failure, you are safe, in some configuration storage spaces can protect a multiple disk failure, but it requires many hard disks. Many PowerShell commands as well as administrative user interface let you manage Storage Pools, and Virtual Disks that you allocate from them.

This is going to be a long “Diary” like blog post that describes over a month of a fight to bring back a 35TB (70 TB physical disk space) logical disk to life. If you are in such a situation, I’m pretty sure that you’ll read every word 😊, I read everything I can, when I search for a cure to a problem.

Friday, November 30, 2018

I have never lost a single file. Hard drives are fragile; the internet bandwidth is still not fast enough to backup large amounts of data in a timely manner. My solution for file-insurance: have a local redundant storage as well as a cloud-based backup.

I have a Storage Spaces pool that got larger over the years. It has 12 hard disks, in a mirror configuration. The smallest pair of disks are 3TB, then a group of 4TB, 6TB, 10TB and lately I added two 12TB disks. All together gives me about 70 TB, which is a 35TB in a mirror configuration.

You may ask why I need such a big disk space, most of the space is devoted to computer backups, which I only keep locally as opposed to sending to a cloud backup – it is too big and changes daily. Other big chunk of storage goes for my raw camcorder files and my video editing work. I also use it for other purposes such as my company’s financial documents, course materials that I develop and more.

For the cloud storage I use Code42 – Crashplan, but “only” 4 TB is stored there. This includes the most important files, excluding local computer backups and some files that originally were downloaded from the web and can be recovered by re-downloading them (MSDN subscription ISO files for example). All my pictures and some of my final video files have a copy on OneDrive (for easy phone access) and on Amazon drive (for Amazon Echo show).

My Storage Pool setup is a bit strange, but I have a good excuse for that. Originally, I was using the Windows Home Serve 2011 to run my small home office and my home computers. Mainly as a file server, computer backups and internet remote access. When I moved to Windows Server 2012 Essentials, I created a new large Storage Pool, allocated a virtual drive and copy the files over. I bought several big hard disks, and for convenient setup process I connected them to my PC. I created the storage pool using the Windows 10 Storage Spaces capability which is like the Storage Space feature of the Windows Server. Using my PC makes it easier to copy from the old server to the new storage space. After copying the files from the previous server, over the network to the new Storage Space on my Windows 10 machine. I removed the old disks from the server and added them to the storage pool on my Windows 10 machine. This was the new pool and there was no way back, since the original disks were joined in the new pool.

I built a new Server, much stronger than the old one. This new server had a Server board and a Xeon CPU, with 64GB of ram. The board has 13 SATA connectors, which was very important for having the ability to add more hard disks in the future.

With this powerful hardware I installed the Windows Server 2012 Essential as a Virtual Machine. I did it by the book, installing a server Hyper-V host and installing the Server Essentials as a guest.

I moved all the storage pool disks from my Windows 10 machine to the Windows Server 2012 Essentials which I installed on a new server-based machine, but the system did not recognize it. I realized that the problem is a version mismatch. The Windows 10 version was newer than the Server 2012 version. So, I solved it in a very original way, I replaced the server Hyper-V host with a fresh installed Windows 10 client OS and made it the Hyper-V host machine. Windows 10 recognized the large disk! I used the disk management mmc snap-in to mark the storage spaces virtual drive offline and added it as a physical drive pass-through to the Server 2012 virtual machine. It worked!

When Server 2016 came out, I upgraded the system (the virtual machine) but left Windows 10 to be the host and to manage the storage pool. Over the years I added more and more disks. Each time that I added new disk, I shutdown the server virtual machine, brought the storage pool drive online again, manage the storage pool, and took it offline again.

At some point, it was almost impossible to add more disks. I still had enough SATA sockets, but I had no room in the Server enclosure. Therefore, I designed and built a 3D printed case.

Eventually I ran out of room in the server enclosure as well as on the board for the SATA connectors. I didn’t give enough attention to the risk of adding disks to the storage pool using USB-C. I assumed that 10GB/S transfer rate should be enough. After all Storage Pool is a Windows 10 feature, a home user can use it on any hardware constellation. I was wrong!

Bad things happen

Last week I decided to add more storage to the pool since it was almost full. Since I don’t have room in the server case for more disks, I bought a four-disk enclosure box from Amazon and ordered two 12TB disks.

I used a USB-C card that supports 10GB/S to connect the external drive.  After adding the disks, there were times that the server got slow. Looking into the event viewer revealed that I have a problem with the “UASPStor” – Reset to device was issued – every 15 seconds. And from time to time this warning: ” The IO operation at logical block address 0x24e049300 for Disk 13 (PDO name: \Device\000001df) was retried.” I have tried to install new driver but lost the connectivity to the drive, so I rolled back the driver version. Since the USB-C drive is based on an old chip, I ordered a new USB-C card. I also followed these steps that I thought would help to resolve the problem.

I thought that my problems are over, but few days later, I had another issue with the storage space. The virtual-disk disappeared. Looking at the storage space UI, I saw that the pool is unhealthy, and one disk is missing. I tried to remove the disk and reconstruct the virtual drive, but the storage pool job was stuck at 0. I read all I could find on the web about what can be done, just to find out that many people just give up and restart from scratch. I was determined to fix the problem and not to give up.

I tried any PowerShell command related to Storage Space that I could find. From Get and Set Physical disk, to run and stop recovery and optimization jobs. The problem is that the recovery job returns immediately with success without doing anything and other jobs are stuck.

PowerShell:


I couldn’t remove the missing drive, and the “preparing for removal” did nothing. I also saw that the new drive that I’ve added had the “Prepare for removal” command option while other drives did not have it, it looked like the pool didn’t take the new drive as a replacement for the old faulty drive.

I’ve tried different commands such as:

Repair-VirtualDisk >Return almost immediately and does nothing.

Remove-PhysicalDisk > Can’t remove since the current virtual disk is unhealthy.

After spending a whole day and most of the night, I crashed for four hours.

Saturday, December 1, 2018

Recovery actions

In the morning, I first started to restore the most critical files from Code 42 – CrashPlan.

I then decided to check a tool that I found in one of the technical group answers. ReclaiMe Storage Space Recovery tool. I also sent an email asking for support from friends at Microsoft. The problem was that by then it was the weekend already. I got some answers that suggested to attempt a few PowerShell commands that I’ve already tried. They also asked me to send some Storage Spaces log files:

Microsoft-Windows-StorageSpaces-Driver%4Diagnostic.evtx

Microsoft-Windows-StorageSpaces-Driver%4Operational.evtx

I knew that it would take time to get the answers; I decided to continue trying to get my files back.

The ReclaiMe tool had found the storage pool, it found three of them although I only had one. I decided to pay for the tool, a $300 and pressed the “Find drives” option. This is the result:

I thought that this tool could fix the problem, however the tool let you restore the files by copying them to another drive. This requires extra storage. On my PC computer I have a 1TB of free space, and I found additional free disk space on other computers on the network. I also added some 1TB and 2TB disks that I used to have in the server and they were in my drawer. With this extra space I could start the restore process. I restored from Crashplan as well as from the original disks using the “ReclaiMe File Recovery Standard” tool that I had to purchase in addition to the Storage Space recovery tool. At first it seems unfair, I just spent $300 on a recovery tool and now to do the actual recovery they asked me to pay more. Looking at the mail they have sent after acquiring the Storage Space recovery tool, I saw that they give a huge discount and the File Recovery tool is almost free. The recovery process takes a lot of time, but it lets you copy files that it has already found. It also knows how to build the original folder structure. I only wish it could repair the storage space instead of coping the files, since you need a disk space, which is similar to the original disk space. I decided not to recover the computer backup files – it is 14TB, I’ll do a fresh backup once the server is up and running again.

After more than 10 hours it scanned only 2% of the disk, it found most of the files, so I decided to recover some of the folders before letting it continue the scanning. According to the documentations, this means that I may not be able to copy everything until the scan is over. Since these files originated from the Web, I could download them later if I need any of them. One thing to notice is that from time to time the ReclaiMe File recovery tool complained about a missing hard disk:

This tells me two things, first – the USB-C is still having issues, and this may be the reason I couldn’t repair the storage space. I was planning to get the other USB-C card soon, or I instead may purchase a SATA expansion card. The second annoying thing is that the recovery process is waiting for a human interaction – to press the OK button. To automatically press the OK button I tried these three programs: Buzof, DialogDevil and ClickOff. None of them work. I wrote a piece of code that did it:

#include “pch.h”

#include <iostream>

#include <Windows.h>

using namespace std;

int main()

{

    std::cout << “Starting…\n”;

    int times = 0;

    while (true)

    {

        HWND hParent = FindWindow(nullptr, L”Drive offline”);

        if (hParent == nullptr)

        {

            cerr << “Can’t find parent window, retry in 10 seconds” << endl;

        }

        else

        {

            ++times;

            cout << “Posting message for the ” << times << ” time.” << endl;

            PostMessage(hParent, WM_CLOSE, 0, 0);

        }

        Sleep(10000);

    }

}

I compiled it as a C-Runtime statically link x64 Win32 console app and ran it as an administrator. With this I no longer had to manually press OK. I think that ReclaiMe should change the dialog and add a timer that will automatically do a retry.

Since file restore is not as robust method comparing to restoring from a backup, I decided to restore from the CrashPlan everything that I have there, and everything else to restore using ReclaiMe.

Sunday, December 2, 2018

New SATA card – New Day?

I went to work, knowing that in the evening I will have to continue my Storage Space restore work. When I came home, a package from Amazon was waiting – a new SATA PCIe Expansion card. The Reclaim File Recovery was at 13% and there was about 100 times that my code dismissed the warning dialog window. I decided to stop it. I just wished ReclaiMe had a way to stop without losing the process – this is a reasonable feature when we talk about a process that takes days to finish – more about it later.

I shut down the machine and replaced the USB-C card with the new SATA card and somehow managed to pack the additional disks into the Server.

I was hoping that the Power Supply can handle the two additional hard disks, total of 16 disks in one box.  According to this PSU online calculator, my PSU could handle it and even more drives if need be.

I turned the machine on, and when it was up, I saw that it recognized the disks. This is a strong feature of Storage Spaces. As a software-based RAID, It knows how to build the storage pool, no matter if you swap drive locations, SATA connectors with USB or even the whole Windows machine.  With a little more hope, I tried the PowerShell storage command and the Storage Spaces UI, to my disappointment nothing has changed.

I noticed that the Repair-VirtualDisk shows something for a very short time. I used Camtasia Studio Recorder, captured a video and slowly moved frame by frame, and this is what I’ve found:

It takes only 3 frames of the video. A very strange behavior.

Back to plan A, I needed to restart the file recovery process. I lost almost two days of scanning but now the scan should be faster since the connection to the disks is via SATA and not a broken USB link.

After four hours, it was in 0.27%, not faster then the previous scan, however the annoying disk disconnected dialog box was gone.

I contemplated another approach to fix the virtual disk: Maybe I should install a new Windows Server 2019 as a dual boot system and use its advanced tooling to repair the file-system. This might work; however, it might also be destructive, Will I be able to use the repaired disk back on Window 10 and Server 2016? I sent another email to Microsoft knowing that answers will come only on Monday afternoon, when it is morning in Redmond.

I couldn’t go to sleep yet, I decided to search in those log files that I’d sent to Microsoft and I found this:

Drives hosting data for virtual disk {8420A2E7-6021-4294-A856-3CF76D94B11E} have failed or are missing. As a result, no copy of data is available at offset [Slab 0x0 Column 0 Copy 0 State 4 Reason 5] [Slab 0x0 Column 0 Copy 1 State 2 Reason 1] [Slab 0x0 Column 0 Copy 2 State 3 Reason 7] [Slab 0x0 Column 0 Copy 3 State 3 Reason 7] [Slab 0x0 Column 0 Copy 4 State 3 Reason 7. ]

How does this happen on a mirror array? Is there a way to fix it, at least bring the disk back, even with some corrupted files, or my last resort is to wait a week until ReclaiMe finishes to scan everything, and restore the files to another set of disks? I started to realize that restoring from backup is probably my only hope.

Before I went to sleep, I checked the CrashPlan restore process, the Video folder was 664GB out of 1TB. At least one progress goes well.

Monday, December 3, 2018

It is 09:30 PM now, I am back from work and after a family Hanukkah candle lighting. ReclaiMe has scanned 16% so far. It is much faster than the previous scan, almost twice the speed, so the SATA card does enhance the scanning performance. The Crashplan restore of my video files is almost done, it says that it has downloaded 1TB of 1TB and that it needs 15 Hours to finish. I hope it will be faster.

Problem Hypothesis

During the day I was thinking about the root cause of the problem and this is my hypothesis. The two new 12TB disks that were connected via USB-C PCI card that had a repeating restart problem. When the old 4TB disk failed and the server did not response, after a while I had to turn it off. When it came back, the Storage Space sub-system realized that there is a problem with some data that appeared in the Storage Space management blocks but were not found on any disk (the failed 4TB disk and the new 12TB with the restart problem. In such a case the virtual-disk is unhealthy and the system will not bring it up. Is there a way to fix the file system? even with the risk of losing some files? Do I want to have a file-system that is fixed on the account of some lost files? Maybe just for the sake of easy restoration. With ReFS, there is no ChkDsk utility since the file system is resilient, if you change a file, it writes a copy of the data in other disk location and uses a transaction like to comit the change. ReclaiMe says that this behavior leaves many copies of the files on the disk and allows them to be restored to even older versions of these files. Later I learned that there is a task-scheduler task – a data integrity scan, that scans fault-tolerant volumes for latent corruptions and that it has never ran on my machine since the disk is marked as offline…

Power Outage – No way!

The Power Company left a note notifying us that there is a planned power outage for two hours, tomorrow morning. I don’t remember when was the last time that they shut off the power, this is something that they don’t do often, but it is going to stop the file scanning. I decided to hibernate the machine in the morning and continue the scan when I’m back home in the afternoon. To do so I must enable the Windows 10 Hibernate option. I usually disable hibernation to spare the SSD disk space that the Hiberfil.sys file occupies. The command powercfg /h /type full reenables it. No, I don’t see the Hibernate option in the power menu. So let’s try one of the options here. Still no Hibernate option. Let’s try hibernation by issuing the shutdown /h command. The computer shuts down almost immediately. Oh no! it came back as a standard boot. Back to square one with file scanning ☹

I got the hibernate option 😊, at least I could start the scan now, hibernate in the morning and rescan from noon.

I am trying Hibernate again, now from the power menu. No, a standard boot again… ☹

I checked the event log:

Windows failed to resume from hibernate with error status 0xC000007B.

Hibernation did take place, but the resume failed. Too many problems… I decided to stop for a day and resume the file scanning tomorrow after the power outage.

I’ve got an email from Microsoft, they think that I have a problem that is happening due to more than one drive is having an error, they told me that they still did not give up and they will reach out and tell me what to do later today, their assumption is like my hypothesis.

Tuesday, December 4, 2018

I am back at home; the power outage took place in the morning and by noon they restored the power supply. My wife told me which UPS worked and which turned off immediately. At least now I know which UPS needs a battery replacement.

in two hours the Microsoft Connect() online conference should start. Until then, I’ll restart the recovery process. On my way home, I stopped at a computer store and bought another 10TB Hard drive. I am going to use it as a restore file destination.  

I ran the ReclaiMe File Recovery and opened the xml files that stored the Storage Space disk array that was discovered by the ReclaiMe Storage Space Recovery program and found out that the file is no longer valid since there is a new drive.

I had to run the Storage Space Recover again.

I had to identify the disks that are in the array or identify those that they are not. As you can see, they are Disk 5, 10 and 15.

The discovery process takes time:

And we have a go…

Checking my email, there is a message from Taylor, the Microsoft software engineer that is kindly helping me to solve the problem:

Okay, looking at these logs, it looks like two other drives are intermittently failing IOs.  Both of your WDC WD12 1KRYZ-01W0RB devices are seeing IO errors including timeouts and errors indicating the device is no longer present.  It also seems that some of our resilient metadata got hit by these failures, and that’s why the space isn’t attaching.  Luckily this is something I can fix, but I’ll need to get a tool out to you.  It will probably take until tomorrow or Wednesday, but I should have something that at least will let you attach the space.

Thanks,

Taylor

Amazing, there might be a light at the end of the tunnel! Meanwhile I’ll continue with the restore process. I started to copy the files that were downloaded from Crashplan. It’s about 1.5 TB by now. I also started to restore other files.

Wednesday, December 5, 2018

No sign from Microsoft yet, it is 13:30 PM at Redmond, so there is still a chance that I will get the repair tool today. When I looked at the Crashplan restore process I found out that one of the hard drives that I got from my drawer is failing. I removed it and continued to use only the 2TB drive which is newer. When I tried to restart the Crashplan restore, I didn’t find any file. It took me some time to understand that Crashplan had marked all the Server backup files as deleted. I don’t know why, maybe it is since over a week had passed since it had a connection with the Server. I contacted Crashplan support, they open a ticket and they will reply at a later point. I asked the Crashplan software to show deleted files and I resumed the download process.

Up until now I restored from Crashplan 419,428 files – 1.82 TB of storage. ReclaiMe has found 2,542,741 files – 20 TB, and it only scanned %11 of the disk.

Thursday, December 6, 2018

Still no answer from Microsoft. The Restore from Crashplan continues. I’ve got answers from Crashplan support:

Hi Alon,

Thank you for contacting Code42 technical support.

My name is Lawrence and I am helping out Cecilia on some of her tickets as she is out of the office.

I am happy to see that the error message has gone away and that you can access your restore. For the files being marked as deleted, this happens when CrashPlan can no longer see files and folders during a file verification scan. This can happen if the files/folders have been deleted or are no longer accessible at the time of the scan. CrashPlan by default, does not remove these files and will update their status in the backup instead. You can verify that your settings are this way by checking your deleted file retention settings.

Once the files have been restored and a file scan can see and access them again, you should see the files updated properly in the backup and no longer marked as deleted files. Please let me know if you have any questions or comments.

Best regards,
Lawrence J
Customer Technical Support
Code42.

OK, no problem with Crashplan. The ReclaiMe scan is also continuing, by the end of the day it was 18%. I decided to copy some of the usable files from the content that ReclaiMe has found so far, just for the case that the scan will stop again and I will need those files. I will need to restore these files again after the scan is over, since there might be other files in those folders that the scan has not yet found.

I asked ReclaiMe support the following questions:

  1. Is there a way to know if a scanned folder contains all its files or should I need to wait until the scan is over, before coping it?
  2. Will I know if there are files that the scan did not find?
  3. Can I save the progress, so if the machine goes down because of an error or power failure, I can restart from the middle and not start the scanning process again (I had to restart it already twice and lost three days)?

I hope that they reply soon.

Friday morning, December 7, 2018

It’s been a week that my Server is off. I am planning to bring it online using the filesystem that I already restored. It will work with one 10TB disk, while I’ll continue to scan and restore other files. I can do that since the server runs as a virtual machine. I’ll change the disk setting (Drive E) to have the new 10TB instead of the old storage spaces based virtual disk. I hope that it will work. I will stop the client computer backup service; I need much more disk space for backups.

Friday afternoon:

I decided to postpone my intension to bring back the server today and wait another day or two, until I restore the “File History Backup“. Its weight is about 1TB and it is backed up to Crashplan.  I don’t want to have another unstable service, and client computers need this folder to push the history files from their local cash.

Checking the file history backup at Crashplan, I found a problem, I need to change my plan of restoring “File History Backup”. According to Crashplan, I didn’t backup all of it. I didn’t backup my own file history files:

I did it because I have a separate backup of my Windows 10 PC and my Surface Book 2 to Crashplan. When you do a cloud backup and you have too many files that may change and need to be backed up, they compete over the network upload bandwidth. You should choose a backup set that contains the most important files. Since my desktop and laptop are backed up to Crashplan anyway, I decided not to backup file history. I am not sure if it was a wise decision. File history contains previous versions of files in a default granularity of one hour, i.e. it saves all tracked files that have changed since the last time it saved them, every hour. If you write a document or edit a source code, you can go back to previous versions. The server backup has a granularity of one day, and if you go to past backups, the granularity becomes weeks, and months. Today most of our files end up in the cloud, in our mail server, SharePoint, google/amazon/OneDrive/Dropbox drive, social networks, GitHub,… some of those places also have versioning capability, but File History Backup does it locally, every hour. Crashplan also gives you history (old versions) backup, in a one-day granularity.

I must decide if I want to initiate the File History Backup (after all I can always go to Crashplan and seek for old file there) or should I wait until I restore it with ReclaiMe.

Talking about ReclaiMe, I got a mail from their support:

Hello Alon,

In ReclaiMe File Recovery there is no capability to save the state of the software, but we do have such capability in our ReclaiMe Pro software (www.ReclaiMe-Pro.com) which is desfined for data recovery technicians. I have issued a full key for you so that you can solve your case (35 TB refers more to complex cases rather than to “home user” cases):

XXXX-XXXX-XXXX-XXXX

Download the software at http://ReclaiMe-Pro.com (request the link) and then activate with the key above.

Please do not publish the key in the web, it is just for you)

Also, did you check the files the software found? I mean did you preview them, preferably images or pdf? Are they OK?
With ReFS, you need to wait till the end, we do have algorithms to bring ReFS data earlier into the scan (and most data are brought within 3-5 % of the scan) but still some files can be found at the end.

Best,
Yulja P.


ReclaiMe Support Team


They are so kind and helpful. Do I want to stop the scan and start over with the “Pro” version? No, I will continue to use the standard version scan and only if it fails, I’ll switch to the pro version.

The “Pro” version has other capabilities, such Partition Recovery, even for Storage Spaces. I don’t think that it will fix my problem, since the problem is not in the partition, but in the Slab metadata files, but I can try it once the scan is over and after I’ll copy all my files.

The Crashplan restore is over. The only folder that I didn’t restore is File History Backup, that does not contain the history of my user. I decided to restore the backup of all other computers/users from Crashplan and restore the file history backup of my user with ReclaiMe. When this restoration process will be over, I will bring the server back.

Saturday, December 8, 2018

New day, new problem. One of the disks is causing problems again:

This time I’m a bit nervous. Last time the problem was the USB-C connection, now everything is connected via SATA. Looking at the SMART information, it says that there are many Interface CRC Errors:

The problem might be the SATA cable. I am not going to fix it until the scan is over. I again ran the code that I wrote that automatically dismisses the error dialog. I’m almost sure that the problem is not the hard disk, but the connection. Or the ReclaiMe app itself, since there is nothing in the system event log that reports anything about disk problem. ReclaiMe reports about two different disks that get disconnected, I think that having the problem with two different disks is something that is unlikely to happen. Anyway, my “Dismissed Dialog” application does the job.

I also decided to copy all the files that the scan already found, just to be on the safe side, and not regret that I didn’t do that when I could. File History Backup files are not yet fully discovered by ReclaiMe, you see folder names as numbers:

ReclaiMe does not know yet the real folder name. This means that you can find files but can’t reconstruct the folder tree. I can’t use it until it finds all.

I shot Microsoft another email. Just to let them know that I still need their support.

Restoring files took all day, by now it restored 2TB, it will probably take the night to finish. I checked some of the files and the directories structure and I feel good about the restoration results. Most of the folders contain all the files, only the File History Backup is not in good shape yet. But that’s not as important. Anyway, I’ll let ReclaiMe finish the scan and restore the files again.

Sunday, December 9, 2018

The restore process is over. It restored about 4TB of data, and it looks good. I also looked again at the File History Backup folder and saw that the scan had found some of the folder structure, but not under my account, which is the one that I did not backup to Crashplan. The scan process is now at almost 43%, so there is still a good chance that it will find all of it. I began thinking about restoring the Client Computer Backups.

It weighs 11.5TB and it looks like the scan process had found all of it. There are two concerns, it will take about 3-4 days to copy it to another disk, and I don’t have such a large disk. I have several days until the scan is over to decide what to do.

Monday, December 10, 2018

It had passed 50 percent!

I’ve got an email from Taylor from Microsoft. He would like me to collect more information before he can send me the tool that will fix the problem. I replied, asking if it safe to run the commands that collect the data while the scanning process is ongoing. I will try to fix the pool after the scan and recovery is over – to be on the safe side.

Tuesday, December 11, 2018

I’ve got a mail from Taylor, he agrees that it is best to finish the scan before doing anything else.

Wednesday, December 12, 2018

In the morning, the scanning process was at 67%. The auto-dismissed program closes the disk disconnected dialog 4700 times, but it is now appearing less frequently. I sent an email to Taylor asking him if he thinks that the pool can be repair to a state that it is safe to use, or I will be able to attach the virtual drive and copy the file over to a new disk array. If I will be able to fix the pool, I will take the risk and not copy the Client Computer Backup folder, otherwise I may buy additional 12TB hard disk just for that.

The File History Backups folder tree is not yet fully discovered by ReclaiMe, I think that this happen because this folder changes often and it contains many deleted files that ReclaiMe finds anyway. I’ll wait until the scan is over, but there is a good chance that only the repair tool that I’m expecting from Microsoft might recover this folder.

Wednesday Noon

A phone call from my son, Dad, there is power outage, all UPSs at home are beeping, oh no! not again ☹

Wednesday Evening

I’m back home, back to square one. This time I decided to run the tool that I’ve got from Taylor from Microsoft. It worked and created lots of data. I zipped it (1.7GB after zip) and uploaded it to OneDrive. Once the upload process will be done, I’ll share it with Taylor. Meanwhile I installed the ReclaiMe Pro. It has a different user interface and many more options. It looks a bit outdated, but I don’t need it for the look, I need it, so I will be able to scan the disk and save the progress.

It runs much slower, maybe it is something that I didn’t set correctly. The Save state button is disabled, but the documentation says that it will be enabled after few minutes of scanning.

Thursday Morning, December 13, 2018

I’ve got an answer from ReclaiMe support that says that scanning speed should be on a par with the standard version. And indeed, the scan became faster, it is now the same speed as the ReclaiMe standard edition. However the “Save state” button is still disabled.

Thursday Afternoon

I’m back home. Back to square one. For some reason the scan stopped, the PC did a reboot:

I knew that the system should not be updated often because I set the Update Advanced Option to not install updates unless it is very important:

I guess 160 days had passed… anyway now I asked the system not to do an update for 35 days.

Friday, December 14, 2018

The new scan reached the 7.8%, it found the main directory structure and most of the files, but the “Save state” button is still disabled. I really need to save state if I want this scan to come to an end.

Saturday, December 15, 2018

Scanning continues, still the “Save state” button is disabled.

Sunday, December 16, 2018

Scanning continues, still the “Save state” button is disabled.

Monday morning, December 17, 2018

Scanning continues, the “Save state” button is still disabled; it is now at 64%. I’ve got an email from Taylor, Microsoft. The email contains a tool and instructions that supposed to fix the problem. I really want to run the commands, but it will stop the file scanning, and if something will go wrong, ReclaiMe might not be able to work anymore. I decided to finish the scan, copy the files again, excluding the Client Computer Backup, and then run the tool. I must do it this week before the people oversee go to the Christmas vacation and there will be no one that could help. I think that on Wednesday I can start the copy, on Friday I will run the tool.

Monday afternoon

I’ve got an answer from ReclaiMe support, they asked me to try to pause the process and see if the “Save state” button becomes enabled and it did.

Tuesday evening, December 18, 2018

The scan is almost over. I saved the process again.

Tomorrow I’ll copy the files!!!

There are still lots of unclassified files, files that are found but their location in the directory structure is not known. I hope they will be sorted out when the scan is done. If not, I’ll copy them, just to have them in case I’ll find that there is an important missing file. I hope that I won’t need it anyway because of the fix that I’ve got from Microsoft, but I’m doing it to be on the safe side. As Tailor from Microsoft said: “Better safe than sorry”.

Wednesday morning, December 19, 2018

The scan is over; to be on the safe side I saved the final state. I started copying the files. It has found 400GB of unclassified files, which is either strange or there is a real problem with the integrity of the filesystem. It says that the copy process of about 5TB is going to take about 32 hours; I hope that it is a wrong early estimation.

Wednesday evening

I’m back home, it is over 12 hours, the copy process stopped because of an “overflow” problem. I continued the copy, but I probably lost 10 hours. Now it says that it is going to take 7 days to copy the files, this is a wrong estimation for sure.

Thursday Afternoon, December 20, 2018

The copy is over, I hope so, because there is no sign that the copy finished successfully. I checked the property of the root folder to find out that it is probably contains the copy of all restored files that I needed. With 5.33TB and 1,962,497 files, it looks promising.

It’s time to run the Microsoft’s tool that should fix everything.

Reboot, and: Get-VirtualDisk MainPool | Connect-VirtualDisk

Starting the virtual machine, the server is back.

Yes, the disk is back. Now I need to run:

Start-ScheduledTask “\Microsoft\Windows\Data Integrity Scan\Data Integrity Scan”

I’ve got this in the event log:

OK, I think that I know what has happened. I ran the last command on the server VM, According to this: https://github.com/MicrosoftDocs/windowsserverdocs/blob/master/WindowsServerDocs/storage/refs/integrity-streams.md, it can’t fix ReFS if it is mounted as non-resilient disk and for the Server the ReFS file system is on a single non-spaces disk, since it is a physical disk path-through in the VM configuration. I shutdown the server and ran the command on the host, and it was able to fix those problems:

Friday, December 21, 2018

The server is back, but the storage pool still lists a non-existing retired disk. I decided to go and fix it:

It succeeded:

The pool is fixed, but I still find corrupt files that need to be fixed. I decided to take the safe pass and copy all files to another disk, over the LAN. Since I don’t have enough disk space, for “Client Computer Backup” and “File History” I wrote a simple program that traverse the disk and read each file in it, this will ensure that if there is a corruption, it will be discovered, and either be fixed by ReFS integrity mechanism, or be reported as an error. I could use one of the utilities that scans files such as Everything, or WinMerge or an Anti-Virus scan, but I wanted to make sure that I read all file information, so I developed my own.

Saturday, December 22, 2018

Huston we have some problems:

Both processes found problems; I fixed the program that scans the files to ignore errors and continues and restarted it. I tried figuring out the problem with the file copy. It complains about a missing file, and the file is not there. I copied this file from the Crashplan restore folder, and it continued. I needed to do it again for another file.

It makes me wonder, should I compare the original files with the files I restored from Crashplan?

I downloaded WinMerge, and this proved my suspicion:

WinMerge can copy the missing files, but I decided to use Robocopy and check the result with WinMerge, I started with one sub-folder:

The result is good, so I did it for all other restored files.

Sunday, December 23, 2018

Continue with the ReFS fixing, I decided to balance the pool, it is another process that takes time and it may trigger ReFS hilling:

And my code that scan the files continues:

As you can see it continues to find problems, ReFS fixes them.

I did see some files that it could not fix, and trying to copy them resulted in this error:

When I copied the file again it succeeded, this shows that ReFS eventually fixed the problem. I think that I am going to spend a couple more days recovering the filesystem before I’ll bring back the server. Then I will try to fix the client computer backup and the file history.

Monday, December 24, 2018

There are some corruptions that ReFS can’t fix. I copy over the restored files. I am starting to think that I will need to create a new virtual disk and to copy files over, just to make sure that the file system has no corruptions.

I decided to try to recreate the pool inside the Windows server VM. To do so I disabled the storage space driver: sc config spaceport start= disabled

I knew that there is a risk that the Server 2016 may change the storage space in a way that Windows 10 – the host will have trouble to use it later. When the server started it couldn’t recognize the storage space. All the disks were there, but each one on its own. I decided to shutdown and go back to my regular configuration.

Tuesday, December 25, 2018

I finish going over the Client Computer Backup files, 3 of them could not be restored and were not even found in the files that ReclaiMe restored. I decided to give it a try and brought the Server back. I ran the Computer Backup Recovery process on the server and it failed.

Going back to the host, I found this:

I continued to go through other files, and found out that I have too many corruptions that cannot be fixed:

Even if the event log shows that ReFS was able to fix the problem, the file is still corrupted. Sometime the file is completely gone – deleted by the system with no warnings and sometimes it can’t be repaired. Maybe this blog post is true:

https://social.technet.microsoft.com/Forums/windowsserver/en-US/cd28095d-e421-4538-9a9f-a15260e79a75/refs-test-with-corrupt-data-does-it-work?forum=winserverfiles

At least for old versions of ReFS on Storage Spaces.

I decided that I must extract anything that I can from the current ill filesystem and recreate a new one. The main problem is that I have no more hard disk space to copy all. I already have the data that ReclaiMe restored and the data that I restored from Crashplan – a total of 10TB. Now I need to have another 5-6 TB for the original server filesystem copy – at least for the non-corrupted files, and I am not going to copy the Client Computer Backup. It is going to take another day or two until I’ll finish to copy all. The final state will be, three copies: The data restored by ReclaiMe, the data restored from Crashplan, the data that was copied from the original server – at least the part that is in a good shape. Once I’ll have the new storage space, or maybe now more than one storage space, I will copy everything. I’ll start with the files that came from Crashplan – I trust the backup to be authentic, then the files from the original server – only those that are not in Crashplan backup, and then the files from ReclaiMe – those that don’t yet exist – the ones that got lost because of the corruptions.

I might lose some files, but these are not as important, the important files are all backed up in Crashplan.

I also decided that I will create a new host for the VM, I will install a Windows Server 2019 host and have the same Server 2016 VM hosted in the 2019 Hyper-V. I will also move from pass-through physical disks to virtual disks. I’ll create several VHDX virtual drives on top of ReFS and in the Server 2016 I’ll create NTFS filesystem instead of ReFS. This will provide a much richer file system capabilities, for example I will be able to use Azure Backup that can’t be used with ReFS. Moving from path-through configuration to a VHDX based will also enable VM Checkpoints capabilities.

I will not upgrade the Server 2016 itself, because in Windows Server 2019 Microsoft remove the Server Essential Experience capability, although I found this, it is not supported and I will stick with Server 2016 for now.

Wednesday, December 26, 2018 – Thursday, December 27, 2018

I didn’t do much, just continued to copy the original files to a new disk using Robocopy. Some of the files are corrupted. At the beginning, for each corrupted file, I manually copied a good version either from Crashplan restore, or from ReclaiMe restore. When I found out that there are too many corruptions, I decided to skip those files and later merge from the restored folder as I stated before.

Friday, December 28, 2018

The copy progress continues, I think that it will be over by the evening.

I created a list of files of the original server files, to be able to compare the restored version with it. The list is not full since ReFS deleted many corrupted files, but this is what I’ve got to work with.

A Week Later, January 8, 2019

After more than a month, I have a working server. I didn’t lose much, mainly I lost the computer backups, which I will soon finish the backup process of all of them. The file-history was restarted for the client machine, I can still search for old files, but the file-history user interface shows only files from the last couple of days.

I reformatted all hard disks and rearranged them in the server case. I found another disk with some problems and I decided to take it out.

My current setting is much better then before. The host system is Windows Server 2019 instead of the Windows 10. On the Windows Server 2019 I have created 3 storage pools. I separated the computer backups from the main file system. The storage pool virtual disk is formatted as ReFS. I have created a VHDX virtual disk in each storage pool and I formatted the file system to be NTFS. This allows for example Azure Backup that is not supported under ReFS. NTFS forced a limit of 16TB on one disk, but even the computer backup doesn’t need such a big disk. Since all drives are based on virtual disks, I can take a VM checkpoint, something that cannot be done when you have a pass-through physical disk. The Windows Server 2019 handles the pool and the ReFS, the disks are not offline as it is used to be in my previous configuration, hence corruption prevention is always on.

To make sure that I got all my files I ran the Karen’s Directory Printer on the new server file system. I then compared the two big file lists and found out which file was missing or modified. I had several of them that I found on the restored files that ReclaiMe restored. Utilizing the MD5 file hash I could find files that had been changed. There were about 2000 such files. When I investigated the content of the files, I found that the file that was copied from the original folder, after applying the fix from Microsoft, were corrupted while the files that came from Crashplan and from ReclaiMe were in good shape. I had to restore only two files from my OneDrive backup. These files were added to the server just before the failure occurred and they were not existed on Crashplan.

I also installed a disk monitor utility on the host machine:

Conclusions:

  1. Backup, Backup, Backup, everything, even files that come from the Web, if you need them, back them up, it will save you a lot of time.
  2. Don’t create Storage Pools that are too large. Create two or more small pools – smaller troubles, faster recovery time.
  3. If there is a hardware problem related to disks, even if it just warning, fix it as fast as possible.
  4. Run disk monitor software and replace disks as soon as there is a warning or error.
  5. Replace old disks, if they are out of warranty, just do that. There are some disks that can run for a long time, however each year hard disks become chipper and larger, when you replace an old disk, you get a better and bigger one.
  6. Be careful when connecting a disk to storage pool via USB, prefer a certified JBOD enclosure.
  7. On a Windows client operating system such as Windows 10, there is a good chance that Windows Update will kick up and reboot the system while you are in the middle of the restoration process. Suspend Windows Update, if you can.
  8. There are good people in the world that really like to help, at Microsoft, at Code42 – Crashplan and at ReclaiMe.
  9. Be patient, recover of a large storage pool takes its time.
  10. Don’t use a non supported setup, i.e. running Windows Server 2016 on a VM hosted on Windows 10, where the storage pool is created in Windows 10, and passed through as a physical disk to the Windows Server Virtual Machine…  

Thanks:

I’d like to thanks Yaron Bental that was so kind to technical review the article and provides his feedback and fixes.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: