Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression?

HiddenLayer555@lemmy.ml · 2 months ago

Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression?

DasFaultier@sh.itjust.works · 2 months ago

und denke mal, bei dem Username, dass du deutsch sprechen kannst haha Jup, stimmt. :D

Ich bleib’ trotzdem mal bei Englisch, damit’s im englischen Thread verstanden wird.

ENGLISH: Yeah, you’re right, I wasn’t particularly on-topic there. :D I tried to address your underlying assumptions as well as the actual file format question, and it kinda derailed from there.

Sooo, file format… I think you’re restricting yourself too much if you just use the formats that are included in binutils. Also, you have conflicting goals there: it’s compression (make the most of your storage) vs. resilience (have a format that is stable in the long term). Someone here recommended lzip, which is definitely a right answer for good compression ratio. The Wikipedia article I linked features a table that compares compressed archive formats, so that might be a good starting point to find resilient formats. Look out for formats with at least Integrity Check and possibly Recovery Record, as these seem to be more important than compression ratio. When you have settled on a format, run some tests to find the best compression algorithm for your material. You might also want to measure throughput/time while you’re at it to find variants that offer a reasonable compromise between compression and performance. If you’re so inclined, try to read a few format specs to find suitable candidates.

You’re generally looking for formats that:

are in widespread use
are specified/standardized publicly
are of a low complexity
don’t have features like DRM/Encryption/anti-copy
are self-documenting
are robust
don’t have external dependencies (e.g. for other file formats)
are free of any restrictive licensing/patents
can be validated.

You might want to read up on more technical infos on how an actual archive handles these challenges at https://slubarchiv.slub-dresden.de/technische-standards-fuer-die-ablieferung-von-digitalen-dokumenten and the PDF files with specifications linked there (all in German).

Ferk@lemmy.ml · 2 months ago

Just note that @RiverRabbits@lemmy.blahaj.zone wasn’t the one who opened the Thread, that’s why they said they didn’t ask the question (I get the feeling there might have been some confusion here :P ).

Still, very informative comment.

RiverRabbits@lemmy.blahaj.zone · 2 months ago

Haha, yeah I’m not the OP! But the way my german is phrased here and how the replier interpreted it would read as super passive aggressive (think “I didn’t ask that question but thanks”), and for that I apologize 😭 I just meant I’m not the OP😌

DasFaultier@sh.itjust.works · 2 months ago

Of yeah, there really was, thank you. :)