The Mathematics of RAID6 Revisited: RAID Storage for a Martian Future
Welcome to StreamScale: Building the Cosmic Ray Umbrella for Mars and Earth.
Signed in as:
filler@godaddy.com
Welcome to StreamScale: Building the Cosmic Ray Umbrella for Mars and Earth.
Patterson, Gibson, and Katz introduced RAID in 1988, defining levels 1–5 with RAID5 using single parity for one failure. Their focus was Earth-based systems, not accounting for cosmic rays or silent corruption, which are critical on Mars (10x bit flips, 230 mSv/year).
In 1997, James S. Plank published “A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems,” aiming to explain how Reed-Solomon codes could be applied to RAID for multiple failure tolerance. Plank suggested using a Vandermonde matrix to encode data into check symbols, computing syndromes as weighted sums of data symbols alone (e.g., for check symbols P and Q, using rows of the Vandermonde matrix. However, Plank misunderstood the role of the Vandermonde matrix—it’s not for encoding but for defining the parity-check equations (H matrix). In a correct Reed-Solomon code, a codeword c = (d₁, d₂, ..., dₖ, c₁, c₂, ..., cₘ) (data + check symbols) must satisfy Hc = 0, meaning syndromes evaluate to zero at the polynomial roots (α⁰, α¹, ..., αᵐ⁻¹). Plank’s method produced check symbols from data alone, so the syndromes of the full codeword didn’t evaluate to zero, making error location impossible.
Relevance to Mars: Plank’s flawed approach couldn’t reliably locate unknown errors, a fatal flaw on Mars where cosmic ray-induced bit flips (10x rate) and silent corruption require precise error detection and correction for life-critical data (e.g., oxygen levels).
In 2003, Plank and Ying Ding published a correction, “Note: Correction to the 1997 Tutorial on Reed-Solomon Coding,” addressing errors in the original paper (e.g., incorrect matrix operations in GF(2^w)). However, even with the correction, Plank’s approach still used an inverted and normalized Vandermonde matrix for encoding and computed check symbols from data alone, failing to produce codewords that evaluate to zero at the polynomial roots. The syndromes of the full codeword (data + check) were non-zero, so the code remained ineffective for unknown error location.
Relevance to Mars: The corrected paper still fails to address a critical flaw—without zero syndromes, unknown error positions remain undetectable, risking catastrophic data loss on Mars, where cosmic rays and DRAM errors exert far greater stress than on Earth. For instance, in 2013, the Mars Curiosity rover suffered its most significant malfunction to date, just seven months into its mission, likely due to high-energy solar and cosmic ray strikes, as noted by project manager Richard Cook in a National Geographic report: “On other space missions, similar problems were caused by high-energy solar and cosmic ray strikes. He said that’s probably what happened this time.” This underscores the urgent need for robust error correction, like proper Reed-Solomon codewords with zero syndromes, to ensure data integrity for Mars missions facing such harsh conditions.
Peter Anvin’s “The Mathematics of RAID-6” (2004–2011) built on Plank’s work, citing Plank's 1997 paper as a reference. Anvin adopted the same flawed assumption that the Vandermonde matrix is used for encoding, defining P as the parity of data symbols only (P = d₁ ⊕ d₂ ⊕ ... ⊕ dₖ) and Q as a weighted sum of data (Q = 1·d₁ ⊕ 2·d₂ ⊕ ... ⊕ k·dₖ). Anvin also misunderstood syndromes, thinking they were computed over data alone, not the entire codeword. This ordering (P then Q) meant P didn’t include Q, so the codeword didn’t satisfy the parity-check equations (Hc ≠ 0). The syndromes didn’t evaluate to zero, making unknown error correction impossible.
Relevance to Mars: Tucker’s claim that customer-identified errors in his work were “fundamental to the math” is incorrect. His proposed fix—limiting the size of the Vandermonde matrix for encoding—might have reduced the errors customers encountered, but it failed to address the core issue: large codewords didn't scale and couldn’t “self-report” error locations and values, a capability standard Reed-Solomon (RS) codewords have provided for decades. For instance, in 1977, RS codes were integral to the Voyager Program, ensuring robust error correction for data transmission across vast distances—a necessity for space missions. Yet, modern large-scale storage systems like Hadoop, Ceph, and Swift still rely on weaker, non-scalable “erasure codes” that lack RS’s superior error detection and correction. On Mars, where reliable data storage is critical for rovers and habitats facing harsh conditions, it’s imperative that modern storage systems adopt proper RS codewords to ensure data integrity and mission success.
The StreamScale Solution for Mars
Unlike previous work from Plank, Anvin and Tucker, StreamScale's patented technology scales correctly for any combination of data and check symbols up to 255 in total. The patent shown below teaches how to use a Parallel LFSR sequencer to encode data, a Parallel Syndrome sequencer to decode codewords, and requires less than half the instructions of Intel ISA-L to achieve a better result. This patented approach not only enhances performance but also enables seamless integration into existing storage ecosystems, making it a versatile solution for both Earth and Mars. By extending storage system software like Hadoop, Ceph and Swift with real ECC, as the patent below teaches, we could expose the "Dark Matter" of error correction.
The ‘Dark Matter’ of error correction refers to hidden error data that, when logged and analyzed, reveals patterns of failure. Using that Dark Matter information, and the scalable features of StreamScale's patented technology, both storage and computing on Mars would be more efficient, robust and correct. Even in the presence of heavy "showers" of Cosmic Rays, the StreamScale solution provides an "umbrella" of protection for all the datasets essential to establish and maintain a colony on Mars. It protects not only "data at rest", but also the entire communications path between the storage and the application CPU, including DRAM errors, even without using hardware ECC protected DRAM. This technology ensures that critical data—like life support system readings or communication logs—remains intact, safeguarding the lives of Martian colonists 140 million miles from Earth.
Extremely Resilient Storage: ECC Upgrade for Mars and Earth
Leveraging US Patent 11848686, "Polynomial Encoding System and Method", StreamScale Storage for Mars corrects these errors and enhances data and application servers with robust error correction and “Dark Matter” logging. Storage system software like Hadoop, Ceph, or Swift can capture precise error details—e.g., “Error at byte 42, value 3, Mars Station 3, 3/22/25 14:03, Disk serial #1234567, block 246810.” This is vital on Mars, where cosmic rays target DRAM in drives, HBAs and servers, risking data corruption. Beyond space, it boosts Earth-based system reliability, offering a resilient upgrade for extreme environments.
Global Error Heat Map:
Servers equipped with patent US-11848686-B2 report errors to a cloud-based system, creating a real-time heat map of radiation spikes or hardware flaws. For instance: “Sudden increase in detected errors, Mars Station 2, 3/23/25 12:02, multiple disks, multiple racks, currently 3x historical rate.” This visibility helps mission control on Mars (or IT teams on Earth) respond swiftly to emerging threats.
AI-Driven Predictive Maintenance:
Train AI models on syndromes and metadata to predict disk failures with high accuracy. For example: “96% chance of failure, Mars Station 1, Rack 5, Enclosure 2, Disk 8, tomorrow—recent increase in historical error rate.”
On Mars, this capability ensures the reliability of critical systems like life support, while on Earth, it minimizes downtime in data centers.
Why It Matters for Mars
StreamScale’s technology scales to 255 drives (e.g., a RS(191,255) code tolerating 64 erasures or 32 unknown errors), far surpassing Tucker's proposed limits on Vandermonde decoding. By logging “Dark Matter” errors, we enable AI to anticipate failures and map radiation events, ensuring data resilience in Mars’ harsh environment. This innovation not only protects Martian colonies but also enhances Earth-based systems, paving the way for a future where data survives the toughest conditions.
Accelerated Polynomial Coding marks a new evolutionary stage for digital storage, transforming outdated technology into a robust solution for the most challenging environments, like Mars.
StreamScale invites collaboration with SpaceX and other space innovators to integrate this technology into Martian infrastructure, ensuring a data-resilient future.
StreamScale offers very reasonable patent licenses and technical support to spacefaring companies actively developing reliable storage solutions for deployment on Mars, ensuring data integrity in extreme environments.
Let's go to Mars!
Working through Error Examples
For those of you who want to work through the "nitty gritty" details of ECC error correction, the paper below provides important context and useful examples that can be easily replicated. Special thanks to my friends at Baylor University for their contributions to this paper.
Encoding Tables for Zero-Summing Codewords
These tables enable calculation of Reed-Solomon codewords that sum to zero across a Vandermonde matrix, with MSB on the left and LSB (parity) on the right, Q before P. A "parity row" (all 1’s) appears only when the number of check symbols is exactly one (T=1, as in Patterson's RAID3-5); for T>1, it never recurs. Generated via an LFSR seeded with a generator polynomial (listed before each table), these tables align with the Parallel LFSR details for T=4 in the patent (Figures 3A, 3B).
We love our customers, so feel free to visit during normal business hours.
Open today | 09:00 am – 05:00 pm |
Let us know if you'd like more information, especially if you are Mars bound! That's our mission.
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.