ROSE Btrfs - Recovering from a Failed Disk

Hi,

I’m currently evaluating ROSE’s Btrfs implementation under CHR running on Hyper-V with the aim of eventually purchasing a RDS once I’m satisfied about the general usability. As part of this testing I wanted to try deliberately removing a disk to degrade the Btrfs RAID to see how reliable recovery is, and I’ve hit a bit of a brick wall and I’m not sure if this is a bug/missing feature of RouterOS or if I’m missing something.

So far I’ve created a Btrfs filesystem with 10 virtual disks in a Btrfs RAID 10 layout and added some files through an SMB share, this works fine. I then removed one of the disks (from within Hyper-V) and booted CHR and this is where I’ve started to experience issues.

Orginally I was running RouterOS 7.18, with this, the Btrfs filesystem simply wouldn’t mount and attempting to run any of the replace-device/remove-device/add-device commands would throw an error saying that the filesystem wasn’t mounted. I then tried updating to 7.19rc2 based on the release note of “rose-storage - added degraded Btrfs mount option (CLI only);” I’m still unable to find this option on the CLI (is it not in the auto complete yet? Is there any documentation for any hidden commands?) however with this new version the Btrfs filesystem did come up and works as expected and is now showing “MISSING DEVS” which is what I’d expect.

At this point I now have an operational, but degraded filesystem as follows

[admin@MikroTik] /disk/btrfs/filesystem> print
Flags: I - MISSING-DEVS
Columns: LABEL, DEV-IDS, DEVS, DEFAULT-SUBVOLUME, SPACES
#   LABEL      DEV-IDS  DEVS     DEFAULT-SUBVOLUME  SPACES                                                                                                                                            
0 I BtrfsRAID        1  sata1    <FS_ROOT>          sata1:10.7GB, used:88%                                                                                                                            
                     2  sata2                       sata2:10.7GB, used:86%                                                                                                                            
                     5  sata5                       sata5:21.5GB, used:74%                                                                                                                            
                     6  missing                     sata7:21.5GB, used:74%                                                                                                                            
                     7  sata7                       sata8:21.5GB, used:70%                                                                                                                            
                     8  sata8                       sata9:21.5GB, used:75%                                                                                                                            
                     9  sata9                       sata10:21.5GB, used:79%                                                                                                                           
                    10  sata10                      sata3:10.7GB, used:88%                                                                                                                            
                    11  sata3                       sata4:10.7GB, used:84%                                                                                                                            
                    12  sata4                       data,raid10:58.3GB sata1:35.0GB sata2:34.8GB sata5:56.2GB sata7:56.2GB sata8:53.4GB sata9:54.6GB sata10:58.3GB sata3:38.1GB sata4:36.2GB, used:99%
                                                    system,raid10:67.1MB sata1:67.1MB sata2:67.1MB sata5:67.1MB sata7:67.1MB sata8:67.1MB sata9:67.1MB sata10:67.1MB, used:0%                         
                                                    meta,raid10:671MB sata1:671MB sata2:671MB sata5:671MB sata7:671MB sata8:671MB sata9:671MB sata10:671MB, used:10%                                  
                                                    global-reserve:64.4MB, used:0%

As you can see here, the disk I removed (sata6) is now showing as “missing” but this is now where I’m stumped, as i can’t find any way to replace this disk with a new one.

Things I’ve tried:

Running “/disk/btrfs/filesystem/replace-device BtrfsRAID device-to-remove-id=6 device-to-add=sata6” with the replacement disk completely wiped. Results in the replace status becoming “BTRFS_IOC_DEV_REPLACE failed: Not a tty”

Running “/disk/btrfs/filesystem/replace-device BtrfsRAID device-to-remove-id=6 device-to-add=sata6” with the replacement disk formatted as Btrfs. Before running the replace command the filesystems are as follows:

[admin@MikroTik] /disk/btrfs/filesystem> print
Flags: I - MISSING-DEVS
Columns: LABEL, DEV-IDS, DEVS, DEFAULT-SUBVOLUME, SPACES
#   LABEL      DEV-IDS  DEVS     DEFAULT-SUBVOLUME  SPACES                                                                                                                                            
0 I BtrfsRAID        1  sata1    <FS_ROOT>          sata1:10.7GB, used:88%                                                                                                                            
                     2  sata2                       sata2:10.7GB, used:86%                                                                                                                            
                     5  sata5                       sata5:21.5GB, used:74%                                                                                                                            
                     6  missing                     sata7:21.5GB, used:74%                                                                                                                            
                     7  sata7                       sata8:21.5GB, used:70%                                                                                                                            
                     8  sata8                       sata9:21.5GB, used:75%                                                                                                                            
                     9  sata9                       sata10:21.5GB, used:79%                                                                                                                           
                    10  sata10                      sata3:10.7GB, used:88%                                                                                                                            
                    11  sata3                       sata4:10.7GB, used:84%                                                                                                                            
                    12  sata4                       data,raid10:58.3GB sata1:35.0GB sata2:34.8GB sata5:56.2GB sata7:56.2GB sata8:53.4GB sata9:54.6GB sata10:58.3GB sata3:38.1GB sata4:36.2GB, used:99%
                                                    system,raid10:67.1MB sata1:67.1MB sata2:67.1MB sata5:67.1MB sata7:67.1MB sata8:67.1MB sata9:67.1MB sata10:67.1MB, used:0%                         
                                                    meta,raid10:671MB sata1:671MB sata2:671MB sata5:671MB sata7:671MB sata8:671MB sata9:671MB sata10:671MB, used:10%                                  
                                                    global-reserve:64.4MB, used:0%                                                                                                                    
1   sata6-fs         1  sata6    *FFFFFFFF          sata6:21.5GB, used:2%                                                                                                                             
                                                    data,single:8.39MB sata6:8.39MB, used:0%                                                                                                          
                                                    system,dup:8.39MB sata6:16.8MB, used:0%                                                                                                           
                                                    meta,dup:268MB sata6:537MB, used:0%                                                                                                               
                                                    global-reserve:3.41MB, used:0%

After running the command I get this:

[admin@MikroTik] /disk/btrfs/filesystem> print
Flags: I - MISSING-DEVS
Columns: LABEL, DEV-IDS, DEVS, DEFAULT-SUBVOLUME, SPACES
#   LABEL      DEV-IDS  DEVS     DEFAULT-SUBVOLUME  SPACES                                                                                                                                            
0 I BtrfsRAID        1  sata1    <FS_ROOT>          sata1:10.7GB, used:88%                                                                                                                            
                     2  sata2                       sata2:10.7GB, used:86%                                                                                                                            
                     5  sata5                       sata5:21.5GB, used:74%                                                                                                                            
                     6  missing                     sata7:21.5GB, used:74%                                                                                                                            
                     7  sata7                       sata8:21.5GB, used:70%                                                                                                                            
                     8  sata8                       sata9:21.5GB, used:75%                                                                                                                            
                     9  sata9                       sata10:21.5GB, used:79%                                                                                                                           
                    10  sata10                      sata3:10.7GB, used:88%                                                                                                                            
                    11  sata3                       sata4:10.7GB, used:84%                                                                                                                            
                    12  sata4                       data,raid10:58.3GB sata1:35.0GB sata2:34.8GB sata5:56.2GB sata7:56.2GB sata8:53.4GB sata9:54.6GB sata10:58.3GB sata3:38.1GB sata4:36.2GB, used:99%
                                                    system,raid10:67.1MB sata1:67.1MB sata2:67.1MB sata5:67.1MB sata7:67.1MB sata8:67.1MB sata9:67.1MB sata10:67.1MB, used:0%                         
                                                    meta,raid10:671MB sata1:671MB sata2:671MB sata5:671MB sata7:671MB sata8:671MB sata9:671MB sata10:671MB, used:10%                                  
                                                    global-reserve:64.4MB, used:0%                                                                                                                    
1   sata6-fs         1  sata6    <FS_ROOT>          sata6:21.5GB, used:2%                                                                                                                             
                                                    data,single:8.39MB sata6:8.39MB, used:0%                                                                                                          
                                                    system,dup:8.39MB sata6:16.8MB, used:0%                                                                                                           
                                                    meta,dup:268MB sata6:537MB, used:0%                                                                                                               
                                                    global-reserve:3.41MB, used:0%

It looks like this has initialised a subvolume on sata6 but hasn’t added it to BtrfsRAID?

I then tried running “/disk/btrfs/filesystem/add-device BtrfsRAID device=sata6” This added sata6 to the filesystem and increased the capacity by 10GB, however the missing disk still shows up so this is just adding a disk, not replacing the missing one.

[admin@MikroTik] /disk/btrfs/filesystem> print
Flags: I - MISSING-DEVS
Columns: LABEL, DEV-IDS, DEVS, DEFAULT-SUBVOLUME, SPACES
#   LABEL      DEV-IDS  DEVS     DEFAULT-SUBVOLUME  SPACES                                                                                                                                            
0 I BtrfsRAID        1  sata1    <FS_ROOT>          sata1:10.7GB, used:88%                                                                                                                            
                     2  sata2                       sata2:10.7GB, used:86%                                                                                                                            
                     5  sata5                       sata5:21.5GB, used:74%                                                                                                                            
                     6  missing                     sata7:21.5GB, used:74%                                                                                                                            
                     7  sata7                       sata8:21.5GB, used:70%                                                                                                                            
                     8  sata8                       sata9:21.5GB, used:75%                                                                                                                            
                     9  sata9                       sata10:21.5GB, used:79%                                                                                                                           
                    10  sata10                      sata3:10.7GB, used:88%                                                                                                                            
                    11  sata3                       sata4:10.7GB, used:84%                                                                                                                            
                    12  sata4                       sata6:21.5GB, used:0%                                                                                                                             
                    13  sata6                       data,raid10:58.3GB sata1:35.0GB sata2:34.8GB sata5:56.2GB sata7:56.2GB sata8:53.4GB sata9:54.6GB sata10:58.3GB sata3:38.1GB sata4:36.2GB, used:99%
                                                    system,raid10:67.1MB sata1:67.1MB sata2:67.1MB sata5:67.1MB sata7:67.1MB sata8:67.1MB sata9:67.1MB sata10:67.1MB, used:0%                         
                                                    meta,raid10:671MB sata1:671MB sata2:671MB sata5:671MB sata7:671MB sata8:671MB sata9:671MB sata10:671MB, used:10%                                  
                                                    global-reserve:64.4MB, used:0%

Finally, running “/disk/btrfs/filesystem/replace-device BtrfsRAID device-to-remove-id=6 device-to-add=sata6” while in this state once again gives “BTRFS_IOC_DEV_REPLACE failed: Not a tty”

So at this point I’m basically at a loss and I’m now just stabbing in the dark trying to get it to work! I’d be very appreciative if anyone has any advice or more in depth documentation on the btrfs implementation! Thankfully for me I’m only testing this out in a lab environment (and I’m glad I did before ordering an RDS!) but this seems like a pretty critical issue for anyone using the Btrfs implementation in production!

Thanks,
Cameron

Just as a quick additional update - I was able to replace the disk by exporting all disks using NVMe-over-TCP, mounting them in a Debian VM with btrfs-progs installed, run the Btrfs replace commands under debian and then went back into RouterOS, disabled NVMe over TCP which brought Btrfs back up with the disks now repalced correctly. The array then rebalanced without issue.

Of course this is not an acceptable solution, but wanted to share it to show that (a) my test setup was defininitely possible to recover and (b) just in case anyone is using an RDS and has experienced a disk failure and stumbles across this thread before a proper solution can be found!