I have a lot of tar and disk image backups, as well as raw photos, that I want to squeeze onto a hard drive for long term offline archival, but I want to make the most of the drive’s capacity so I want to compress them at the highest ratio supported by standard tools. I’ve zeroed out the free space in my disk images so I can save the entire image while only having it take up as much space as there are actual files on them, and raw images in my experience can have their size reduced by a third or even half with max compression (and I would assume it’s lossless since file level compression can regenerate the original file in its entirety?)

I’ve heard horror stories of compressed files being made completely unextractable by a single corrupted bit but I don’t know how much a risk that still is in 2025, though since I plan to leave the hard drive unplugged for long periods, I want the best chance of recovery if something does go wrong.

I also want the files to be extractable with just the Linux/Unix standard binutils since this is my disaster recovery plan and I want to be able to work with it through a Linux live image without installing any extra packages when my server dies, hence I’m only looking at gz, xz, or bz2.

So out of the three, which is generally considered more stable and corruption resistant when the compression ratio is turned all the way up? Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting? Additionally, should I be generating separate checksum files for the original data or do the compressed formats include checksumming themselves?

  • just_another_person@lemmy.world
    link
    fedilink
    arrow-up
    7
    arrow-down
    2
    ·
    4 days ago

    Compression formats are just as susceptible to bitrot as any other file. The filesystem is where you want to start if you’re discussing archival purposes. All of the modern filesystems will support error correction, so using BTRFS or ZFS with proper configuration is what you’re looking for to prevent files from getting corrupted.

    That being said, if you store something on a medium and then don’t use said medium (lock it in a safe or whatever), then the chances you’ll end up with corrupted files approaches 0%. Bitrot and general file corruption happens as the bits on a disk are shifted around, so by not using that disk, the likelihood this will happen is nearly 0.

    • TerHu@lemmy.dbzer0.com
      link
      fedilink
      arrow-up
      4
      ·
      4 days ago

      afaik that depends on the type of medium, where ssd are more susceptible to rot than hdd (and never use usb sticks). now this is just my guess, but i’d think that zfs with frequent automatic checks and and such will keep your data safer than an unplugged hdd

    • Blue_Morpho@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      4 days ago

      Bitrot happens even when sitting around. Magnetic domains flip. SSD cells leak electrons.

      Reading and rewriting with an ECC system is the only way to prevent bit rot. It’s particularly critical for SSDs.