< Back

Application and Practice of Erasure Code Technology in CubeFS

2023-07-31Zeng Yuan

Author: Zeng Yuan, CubeFS Contributor, graduate student in Advanced Data System Laboratory, University of Science and Technology of China

1. introduction

1.1 Erasure code usage background

With the continuous increase of the Internet user base, coupled with the rapid development of emerging fields such as artificial intelligence and mobile Internet of Things, the amount of data on the cloud has begun to show an exponential growth trend. In order to process and store ZB-level massive data, the deployment requirements of large-scale distributed storage systems on the cloud are increasing. However, a high-performance distributed storage system is usually composed of many components, and the operation of the entire system involves many aspects including hardware, software, network, electricity, and so on. The complex composition determines that the storage system must deal with various failures including disk damage, node failure, network failure, and cluster downtime, so as to avoid various losses caused by the system failure, including customer service interruption, backup data loss, etc. .

Generally speaking, a storage system ensures reliability by storing a certain amount of redundant data. When the system fails, the redundant data can continue to provide users with various services and repair some data lost due to the failure. The current mainstream fault-tolerant mechanisms are roughly divided into two types: multi-copy mechanism and erasure code mechanism. Since redundant data will increase storage costs, and the redundant storage overhead of multiple copies is too large compared to erasure codes, erasure codes have become the mainstream data redundancy method for current cloud storage systems, and continue to be a popular topic in the academic world. and industry research hotspots in the field of storage.

图片

Figure 1. Partial list of companies using erasure coding technology

1.2 Research direction of erasure code

Research on erasure code theory can be roughly divided into two categories: coding theory design & system process optimization.

The former focuses on the theoretical analysis of erasure codes, conducts research from the dimensions of encoding and decoding complexity, repair traffic, and fault tolerance capabilities, and proposes a new erasure code coding theory, which focuses more on mathematical proofs and does not consider its deployment in real systems. cost. Such research is often published in "IEEE Transactions on Information Theory" and is the theoretical basis for various commercial erasure codes.

The latter focuses on system implementation. This type of research will transform erasure codes and deploy them in real systems based on coding theory and business requirements (cross-AZ, heterogeneous devices, broadband and other scenarios), and then Combined with experimental data to prove its reliability and usability. Such studies are often published in major system conferences, such as ATC, FAST, OSDI, SOSP, etc.

1.3 Text structure

The structure of the rest of this article is as follows: in section 2, we will briefly review the current research status in the field of erasure codes, and introduce related achievements in theory and system implementation; in section 3, we will focus on the three typical erasure codes ( RS, MSR, LRC) coding principles, and compare the indicators at the same time; in Section 4, we will share the application and practice of erasure coding technology in CubeFS.

2. Research status

At present, the most classic erasure code is the Reed-Solomon Code

[1]. Since the code length of the RS code is not limited, and the storage overhead is the smallest under the same fault tolerance, it has been widely used in the industry. deployment. There are two kinds of RS codes, Vandermonde [2] and Cauchy form [3]. Among them, the Cauchy form of RS codes has been widely studied because of its better computational properties: including but not limited to XOR transformation for Cauchy RS codes [4], for the selection of Cauchy matrix [5], for the calculation acceleration of Cauchy RS codes [6], etc. In addition to coding theory, academia and industry have also carried out considerable system optimization on the storage system deployed with RS codes, such as reducing the bandwidth of the repair network through local parallel computing [7], and performing load balancing through task scheduling [8], etc. Work. However, the RS code has a big defect in the coding theory, that is, it has the problem of too high repair flow.

In order to solve the problem of excessively high repair traffic of RS codes, the academic circles proposed a coding theory called regenerative codes [9]. The minimum amount of data required. Under this theoretical framework, there are two forms of regenerated codes: minimum storage regenerated codes (MSR codes) and minimum repair bandwidth codes (MBR codes). Due to the emphasis on storage overhead in real systems, a lot of research in the academic community has focused on MSR codes.

The construction of MSR codes is a difficult task. In order to achieve the theoretical lower limit of repair traffic, MSR codes need to divide data blocks into fine-grained blocks, and the degree of this division is represented by the number of sub-packets. At present, many works [10-13] have conducted detailed discussions on the lower limit of the number of packets, and based on these theoretical analysis, the academic community has also given some classic MSR code construction schemes, such as Zigzag code [14], PM-RBT Code [15], Hashtag code [16], Butterfly code [17], Clay code [18], etc. Similarly, in addition to theoretical research, the academic community also has a lot of research on the system optimization of MSR codes: Tsinghua’s team [19] studied the disk I/O performance problems faced by Clay code deployment in object storage scenarios, Huake’s team [19] 20] The parallel acceleration of Clay codes in the process of encoding and decoding is studied.

However, due to the large number of sub-packets and the limited code length, there are often many problems that cannot be effectively solved in the actual deployment of MSR codes. For this reason, academia and industry are also looking for other codes to solve the problem of excessive repair traffic of RS codes. Among them, a coding theory named Local Repair Code (LRC) came into being.

LRC is a code that groups data, introduces local checks, and reduces the number of repair nodes by sacrificing storage overhead. Because LRC is simple to implement, has strong scalability, and is very suitable for failure scenarios in real businesses, it has a lot of practice in the industry, and has gradually become a research hotspot in the field of erasure coding in recent years. LRC was first proposed by MicroSoft Azure in [21], and then quite a few variants appeared: such as optimal-LRC [22], Xorbars [23], etc. Later, Kolosov conducted a comprehensive evaluation research analysis on various LRCs in [24], and determined the evaluation criteria of LRCs. In addition, a large number of LRC system optimization works have also been proposed: for example, LRC deployment in broadband scenarios [25], traffic optimization in cross-AZ deployment LRC scenarios [26], etc. In the latest FAST conference in 2023, Google also gave the practical experience of LRC in high availability scenarios [27]. It can be seen that LRC has become a very important fault-tolerant machine preparation option in mainstream distributed storage systems.

Next, we will introduce and analyze RS, MSR, and LRC codes in detail.

3. Technical principle

3.1 basic principle

Erasure code technology ensures the fault tolerance of the system by generating redundant check information. The original data will be calculated to obtain encoded data that is more than the amount of original information, and then stored persistently. Generally speaking, erasure codes are usually configured by two parameters (n,k), that is, n pieces of data are calculated using k pieces of original data. We call these n pieces of data an erasure code strip. When part of the data in the strip is lost, the remaining accessible data can be decoded according to the encoding equation to restore the lost data. Figure 2 is a (8, 5) Fault-tolerant representation of erasure codes.

图片

Figure 2. Schematic diagram of erasure code error tolerance

In addition to the fault-tolerant mode of erasure codes, we need to understand two more concepts: systematic and MDS.

If the stripe is stored in the form of original data plus checksum data, that is, the original data is a subset of the data in the stripe, we call this encoding a systematic code. The system code can directly obtain the original data during the read operation without additional decoding operations, and the decoding process is only enabled when the data is restored. Due to the convenience of systematic codes, popular distributed storage systems have adopted erasure codes with systematic code properties as their encoding methods. Therefore, the discussion in this article only focuses on the system code.

The MDS code (Maximum Distance Separable) is defined as: the minimum distance reaches the theoretical maximum value, that is, the linear code of the Singleton boundary. If the parameter of the linear code C is (n,k), then dmin(C)=n-k+1. In layman's terms, for an erasure code with a parameter of (n,k), any k data blocks can be used to decode the remaining n-k data. Under the same fault tolerance, MDS codes represent linear codes with the least storage overhead.

In addition to the above two properties, there are many indicators that affect the quality of the erasure code (Table 1), and the design of the codec directly determines the theoretical boundaries of each indicator. In this chapter, we will discuss three different types of erasure codes in detail: Reed-Solomon Code, Minimum-Storage Regenerating Code and Local Reconstruction Code, And analyze some important indicators from a theoretical point of view, and then clarify the advantages and disadvantages of various erasure codes.

Table 1 Evaluation index of erasure code

storage redundancyThe ratio of the actual stored data volume to the original data volume, this indicator directly determines the storage cost.
Upper limit of fault toleranceThe maximum number of bad blocks that can be tolerated by a single erasure code stripe determines the fault tolerance of the entire system. (p.s. Under the same storage redundancy, the MDS code has the highest fault tolerance limit)
fix trafficThe data flow required to restore data when a node fails is the most important indicator of an erasure code system, which directly determines the performance of the system.
codec complexityThat is, the calculation of the erasure code is complex to read, the encoding complexity determines the data writing delay, and the decoding complexity affects the data degradation reading and repair delay.
Number of SubcontractsThe construction of partial erasure codes needs to split the data block into multiple sub-blocks. The number of sub-blocks in each data block is called the number of sub-packages. This index will affect the I/O performance of the data.
parameter limitPartial erasure code constructions have restrictions on n,k, which will determine the universality and usage scenarios of erasure codes.

3.2 Reed Solomon code

Reed-Solomon Code (RS for short) code is a kind of MDS code, which has two unconstrained parameters (𝑛, 𝑘). Excellent MDS properties, coupled with flexible parameter settings, make RS codes the most widely used codes in current erasure code systems. The design of the RS code is very simple: it calculates the parity block through the encoding matrix, uses the Gaussian elimination method to restore the damaged data, and all operations are performed on the Galois Fields (Galois Fields).

3.2.1 Codec process

Next, let's introduce the encoding process of (n,k) RS code.。We use the mathematical formula:[D0,...,Dk1][D_0,...,D_{k-1}] to represent the original data block, and the mathematical formula: [P0,...,Pnk1][P_0,...,P_{n-k-1}]to represent the calibration check data block. Then the coding calculation of RS code can be expressed by the matrix multiplication of formula (1). The dimension of the encoding matrix is a mathematical formula: nkn*k,, in order to ensure the systematicness of the encoding, the first k rows will form a unit square matrix. In order to satisfy its MDS property, the coefficients of the subsequent n-k rows need to be carefully designed.

[10...0001...00...............00...01α0,0α0,1...α0,k2α0,k1...............αnk1,0αnk1,1...αnk1,k2αnk1,k1][D0D1...Dk1]=[D0D1...Dk1P0...Pnk1](1)\left[\begin{matrix}1&0&...&0&0\\0&1&...&0&0\\.&.&.&.&.\\.&.&.&.&.\\.&.&.&.&.\\0&0&...&0&1\\\alpha_{0,0}&\alpha_{0,1}&...&\alpha_{0,k-2}&\alpha_{0,k-1}\\.&.&.&.&.\\.&.&.&.&.\\.&.&.&.&.\\\alpha_{n-k-1,0}&\alpha_{n-k-1,1}&...&\alpha_{n-k-1,k-2}&\alpha_{n-k-1,k-1}\\\end{matrix}\right]*\left[\begin{matrix}D_0\\D_1\\.\\.\\.\\D_{k-1}\end{matrix}\right]=\left[\begin{matrix}D_0\\D_1\\.\\.\\.\\D_{k-1}\\P_0\\.\\.\\.\\P_{n-k-1}\end{matrix}\right] (1)

In order to illustrate the properties that the coefficient design needs to meet, we first briefly explain the restoration process of RS codes. Without loss of generality, we assume that n-k data blocks are damaged, assuming that these n-k blocks are mathematical formulas: [D2kn,...,Dk1][D_{2k-n},...,D_{k-1}].Now we need to use the surviving math formula: [D0,...,D2kn1,P0,...,Pnk1][D_0,...,D_{2k-n-1},P_0,...,P_{n-k-1}]to repair all corrupted blocks. According to formula (1), we can easily get:

[D0D1...Dk1]=[10...0001...00...............0...1...0α0,0...α0,2kn1...α0,k1...............αnk1,0...αnk1,2kn1...αnk1,k1]1[D0D1...D2kn1P0...Pnk1](2)\left[\begin{matrix}D_0\\D_1\\.\\.\\.\\D_{k-1}\end{matrix}\right]=\left[\begin{matrix}1&0&...&0&0\\0&1&...&0&0\\.&.&.&.&.\\.&.&.&.&.\\.&.&.&.&.\\0&...&1&...&0\\\alpha_{0,0}&...&\alpha_{0,2k-n-1}&...&\alpha_{0,k-1}\\.&.&.&.&.\\.&.&.&.&.\\.&.&.&.&.\\\alpha_{n-k-1,0}&...&\alpha_{n-k-1,2k-n-1}&...&\alpha_{n-k-1,k-1}\\\end{matrix}\right]^{-1}*\left[\begin{matrix}D_0\\D_1\\.\\.\\.\\D_{2k-n-1}\\P_0\\.\\.\\.\\P_{n-k-1}\end{matrix}\right] (2)

We only need to ensure that the square matrix in formula (2) is invertible to ensure that all data blocks can be decoded. Obviously, n-k data bad blocks are the most "difficult to repair" damage scenario. When this damage can be repaired, the rest of the damage can also be repaired by the RS code. Therefore, we have the following requirements for the coding matrix of the RS code:

For the encoding matrix of the (n,k)-RS code, we need to ensure that any kk-dimensional sub-matrix in its nk-dimensional matrix is invertible.

In order to satisfy the above properties, RS codes are usually constructed based on Vandermonde matrix or Cauchy matrix for coding matrix. In the next subsection we will introduce these two forms of RS codes in detail.

3.2.2 coding matrix

3.2.2.1 Vandermonde matrix

First of all, let's discuss the definition of the Vandermonde matrix. The Vandermonde matrix is a matrix in which each row presents a proportional relationship. A Vandermonde matrix with a dimension of n*k is as follows (n>k):

[α00α01α02...α0k1α10α11α12...α1k1.....αn10αn11αn12...αn1k1]\left[\begin{matrix}\alpha_{0}^0&\alpha_{0}^1&\alpha_{0}^2&...&\alpha_{0}^{k-1}\\\alpha_{1}^0&\alpha_{1}^1&\alpha_{1}^2&...&\alpha_{1}^{k-1}\\.&.&.&.&.\\\alpha_{n-1}^0&\alpha_{n-1}^1&\alpha_{n-1}^2&...&\alpha_{n-1}^{k-1}\\\end{matrix}\right]

Through mathematical induction, we can easily prove that the determinant of an n-order Vandermonde square matrix is shown in (3). Therefore, we only need to ensure that the composition of the Vandermonde matrix satisfies the mathematical formula: i,jαiαj\forall i,j\quad \alpha_i\neq \alpha_j , so that the Vandermonde matrix with dimension n*k satisfies its arbitrary k-order square matrix invertibility nature.

det(V)=0ijn1(αjαi)(3)det(V)=\sum_{0\leq i\leq j\leq n-1}(\alpha_j-\alpha_i) \quad (3)

Therefore, the RS code based on the Vandermonde matrix will be constructed according to the following method, in which the first step guarantees the MDS property of the RS code, and the second step guarantees the systematic nature of the RS code:

  • The first step is to construct the corresponding n*k-dimensional Vandermonde matrix according to the parameters n and k of the RS code.
  • In the second step, the elementary row-column transformation is performed on the Vandermonde matrix, so that the square matrix formed by the first k rows is transformed into a unit matrix.
3.2.2.2 Cauchy matrix

The Cauchy matrix starts from the nature of the finite field and directly constructs the coding matrix that meets the requirements of the RS code. An n*k-dimensional Cauchy matrix is as follows (n>k)

[1x0+y01x0+y1...1x0+yk11x1+y01x1+y1...1x1+yk1....1xn1+y01xn1+y1...1xn1+yk1],{xiX,X=nyjY,Y=k\left[\begin{matrix}\frac{1}{x_0+y_0}&\frac{1}{x_0+y_1}&...&\frac{1}{x_0+y_{k-1}}\\\frac{1}{x_1+y_0}&\frac{1}{x_1+y_1}&...&\frac{1}{x_1+y_{k-1}}\\.&.&.&.\\\frac{1}{x_{n-1}+y_0}&\frac{1}{x_{n-1}+y_1}&...&\frac{1}{x_{n-1}+y_{k-1}}\end{matrix}\right] , \left\{\begin{array}{c} x_i\in X , |X|=n \\ y_j\in Y,|Y|=k \end{array}\right.

When the elements in the mathematical formula: X,YX,Yare different in pairs, the construction of the Cauchy matrix ensures that any square matrix of order k is invertible, and the proof is not expanded here. In addition, the RS code constructed based on the Cauchy matrix can map the operation of the mathematical formula: GF(2w)GF(2^w) to the mathematical formula:GF(2)GF(2)through some linear mapping, and then convert a large number of finite Field multiplication is converted to XOR calculation to improve calculation efficiency. The detailed mapping algorithm and XOR acceleration scheme will not be expanded here, and interested readers can learn about it in reference [4].

3.2.3 summary

In general, RS codes are an optimal error-tolerant code with MDS properties. An RS code with parameter (n,k) can use any k blocks in the stripe to decode any remaining number of blocks. Moreover, the construction and engineering implementation of RS code is relatively simple, and there are open source libraries such as Jerasure, ISA-L, Kluaspot, etc., so RS code is widely used in major distributed systems. But from another perspective, even if only a single block is repaired, the RS code must read k accessible storage blocks to complete the repair, which means that the erasure code system deployed with the RS code has a huge repair traffic cost, This cost becomes increasingly unacceptable as k increases. In a real system, a single block failure is the most common scenario, so it is very necessary to optimize the repair flow for a single block failure.

3.3 minimum stored regeneration code

In order to solve the problem of high traffic overhead for single-block failure repair of RS codes, scholars such as Dimakis proposed the concept of regenerative codes. From the perspective of information theory, the regenerative code describes the relationship between the lower bound of node repair flow and the amount of node information, laying a solid theoretical foundation for the realization of a variety of theoretical repair flow optimal systems. Among them, the branch minimum-storage regenerating code (Minimum-Storage Regenerating Code, referred to as MSR) is the coding theory that best meets the requirements of the real system.

However, there is a huge gap between the theory and construction of MSR codes, and the proposed MSR code constructions often have some defects, for example, Butterfly codes can only tolerate two errors. In this section, we will introduce two MSR code schemes with unlimited parameters and relatively simple construction from the practical point of view.

3.3.1 Zigzag code

The design concept of Zigzag code includes: data layering, interleaved coding and data multiplexing. We use the simple (5,3) Zigzag in Figure 3 as an example to intuitively understand its subtle design.

3.3.1.1 example

First, let’s briefly introduce the meaning of Figure 3: In (5,3)Zigzag, there are 3 data blocks, numbered 0, 1, and 2, corresponding to the first 3 columns in the figure; there are 2 check blocks, numbered R , Z, corresponding to the last two columns in the figure; at the same time, each block is split into 4 sub-blocks (that is, the number of packets is 4), corresponding to the 4 rows in the figure. Among them, the R check is calculated through the data sub-blocks of the same layer, and the Z check is calculated through the data interleaving of different layers. For example, the data blocks marked with ♣ located at (0, 0), (2, 1), (1, 2) in the figure will jointly participate in the calculation to obtain the check data at (0, Z).

图片

Figure 3. Encoding and decoding of (5,3) Zigzag code

Now we temporarily ignore the calculation coefficients, consider the single block failure scenario, and then reveal the optimization of Zigzag codes for repair traffic. We assume that data block 1 is damaged. In the repair process at this time, (5,3) Zigzag only needs to read the data shown in the shadow to reconstruct data block 1. Among them, the data of (0, 1), (1, 1) is restored by R check, and the data of (2, 1), (3, 1) is restored by Z check, that is, the data blocks marked with ♣ and ♡. Since the ♣ and ♡ data on all original data blocks are multiplexed during the repair process, that is, they participate in the data repair dominated by the R/Z check block at the same time, the single-node failure of the entire (5,3)-zigzag code The repair flow has been reduced, only 2/3 times of the traditional RS code, which just meets the lower limit of the theoretical repair flow of the MSR code. Similarly, (5,3) Zigzag codes can also achieve the same optimization ratio for other single-node failure situations. Next, we will describe how to construct a Zigzag code with MSR properties systematically.

3.3.1.2 construction process

From the example in Figure 3, we can feel that the core idea of Zigzag code design is to find a good suit sorting rule to provide the possibility of data reuse in the repair process, and then optimize the repair traffic. Here, we briefly give the construction method of (k+2,k) Zigzag codes, and its generalization to arbitrary parameters and the selection of coding coefficients are left to interested readers to refer to literature [14].

Now we reduce the entire zigzag code construction problem into two subproblems:

  1. How to Design a Sorting Function SetF={f0,...,fk1},fi:{0,1}k1{0,1}k1\mathbb{F} = \{f_0,...,f_{k-1}\},f_i : \{0,1\}^{k-1} \rightarrow \{0,1\}^{k-1}, this determines the coding structure of the entire Zigzag code, that is, how the colors are arranged;
  2. How to Design Fix Set PartitionsHow to Design Fix Set Partitions X={Xii[0,k1]}\mathbb{X} = \{X_i|i\in[0,k-1]\}, which indicates the assignment scheme of the Ziazag code when it is being repaired, that is, which check block should be used for the current suit to repair. In (k+2,k)Zigzag code, the number of subpackages of data block is mathematical formula: 2k12^{k-1}, and the total number of suits is equal to the number of subpackages. In order to facilitate the formal description, we can use the k-1 dimension 01 vector to represent the number of layers of suits and data sub-blocks at the same time,At the same time, use the mathematical formula of the arrangement function:fi(x)=yf_i(x)=yto represent the mathematical formula of the first mathematical formula: iidata block:xxsub-block assignment color mathematical formula:yy

Then we define the unit vector mathematical formula: iiwhich means that the vector's first position is 1,and the rest of the positions are 0(math formula:e0e_0represents a zero vector).We can now perform a concise construction of the set of permutation functions for (k+2,k) Zigzag codes:

F={fi(x)=x+eii=0,..,k1}\mathbb{F} = \{f_i(x)=x+e_i|i=0,..,k-1\}

Taking (5,3)Zigzag as an example, we assume that 0-3 represents the suit mathematical formula: ,,,\clubsuit,\heartsuit,\spadesuit,\diamondsuit, now we divide the suit of data block 1, as shown in Figure 4 , we first convert the position of each sub-block into a 01 vector, then use the corresponding permutation function to map each layer to a new 01 vector, and finally convert the obtained 01 vector into the corresponding suit. Therefore, the suit distribution of data block 1 is a mathematical formula from top to bottom:,,,\spadesuit,\diamondsuit,\clubsuit,\heartsuit, which is consistent with that shown in Figure 2.

图片

Figure 4. Suit division process for block 1 in (5,3)-Zigzag

Next, let's define the repair set partition:X={Xii=0,...,k1},Xi={xxei=0}\mathbb{X} = \{X_i|i=0,...,k-1\}, X_i=\{x|x\cdot e_i=0\}. When data block Mathematical formula: Sr={iiXej}S_r=\{i|i\in X_{e_j}\}The data will be recovered through the R check block, and all layers belong to the mathematical formula: Sz={iiXej}S_z=\{i|i\notin X_{e_j}\}will be recovered through the Z check block. Still taking (5,3)Zigzag as an example, when data block 1 is damaged, we divide according to the repair set:

X1={xxe1=x(1,0)=0}={(0,0),(0,1)}X_1 = \{x|x\cdot e_1 = x\cdot(1,0) = 0\} = \{(0,0),(0,1)\}

Therefore, the data blocks with layers 0 and 1 are restored using the R check block, and the data blocks with layers 2 and 3 are restored using the Z check block, which is also consistent with the repair process in Figure 3. This is the structural design scheme of the (k+2,k) Zigzag code. Due to space constraints, the selection of coding coefficients and the promotion of coding parameters will not be expanded here.

In summary, Zigzag code is an erasure code that cleverly utilizes data layering, interleaved coding and data multiplexing. However, there are many mathematical difficulties in the generation of its coding coefficients, and designing a good Zigzag code requires a strong mathematical foundation. Fortunately, the academic research on MSR codes does not stop here, and the Clay codes published on FAST'18 can well avoid the problem of coefficient generation. Clay code is an MSR code design scheme orthogonal to coefficient generation. It can be expanded on any existing MDS code and transformed into an erasure code system with MSR code properties. Next, we will introduce Clay code in detail design ideas.

3.3.2.1 overview

The full name of Clay code is Coupled Layer Codes. Its characteristic is that the system adds a large number of coupled pairs on the basis of data layering. All coupled pairs provide cross-layer data association, which is similar to the meaning of Z check in Zigzag code. . Clay codes use conventional MDS codes in layers, such as traditional RS codes, and perform reversible transformations between coupled pairs, where the cross-layer connection between coupled pairs is generated. After encoding in this way, when a data block is damaged, the data sub-blocks where all coupling pairs are located can provide data information of the same layer and cross-layer at the same time, thereby reducing the amount of data transmitted in the repair process.

As shown in Figure 5, all data in the Clay code has two forms, where the red is the original data form, and the blue is the intermediate data form. When the Clay code encodes data, it first uses the reversible matrix (4) to encode the original data in the coupled pair. This process is called Pairwise Reverse Transform (PRT). At the end of the PRT, when all the data turns into blue form, the Clay code performs intra-layer MDS encoding on all data at the same level, and finally, all coupled pairs are then pairwise forward transformed (Pairwise Forward Transform, PFT) to restore to the original data form.

[C(p)C(p)]=[1γγ1]1[U(p)U(p)](4)\left[\begin{matrix}C(p)\\C^*(p)\end{matrix}\right]=\left[\begin{matrix}1&\gamma\\\gamma&1\end{matrix}\right]^{-1}*\left[\begin{matrix}U(p)\\U^*(p)\end{matrix}\right] (4)

图片

Figure 5. Schematic diagram of (4,2) Clay code structure

Next, we will take the (4,2) Clay code in Figure 5 as an example to show the restoration process of the Clay code and reveal the reason why it can achieve the MSR property.

3.3.2.2 example

We assume that the (4,2)-Clay code has a single-node fault, as shown in Figure 6. At this time, the Clay code first performs PRT transformation on the surviving coupled pairs to obtain the intermediate data form of the coupled pairs. Then the Clay code performs MDS decoding on the surviving data of the layer through survival coupling, and restores the data of the layer corresponding to the faulty node to an intermediate form, which is shown in the lower right corner of Figure 6. Finally, due to the reversibility of the Clay code coupling pair transformation, the faulty coupling pair data that has not been decoded by MDS can be calculated from its associated data, so far, the faulty node has been reconstructed. Throughout the entire restoration process, the Clay code only uses the data in the layer where the surviving coupling pair is located to restore the data, and due to the ingenious design of the coupling pair by the Clay code, the amount of data required for this process just achieves MSR Theoretical repair flow of codes.

图片

Figure 6. Repair process of (4,2) Clay code

Next we will carefully analyze the subtle design of Clay codes for coupled pairs.

3.3.2.3 核心设计

The Clay code uses three dimensions to describe the position information of the node, and the index information of the node contains the mathematical formula: x/yx/ytwo dimensions of the axis, where the mathematical formula: [0,nk1][0,n-k-1]; The range ofyyaxis is math formula: [0,nnk1][0,\lceil\frac{n}{n-k}\rceil-1].We record the modulus of the two as mathematical formulas:x=q=nk,y=t=nnk|x|=q=n-k,|y|=t=\lceil\frac{n}{n-k}\rceil.At this time, the layer number information of the node can use the mathematical formula: ttDimensional vector to describe, where the range of each component of the vector is Mathematical formula:[0,q1][0,q-1]. Therefore, the whole Clay code can be regarded as a mathematical formula: qtqtq*t*q^tthree-dimensional cube, and the design of the coupling pair will be based on such position description design.

For any node mathematical formula in Clay code: (x,y,z0,...,zt1)(x,y,z_0,...,z_{t-1})(zy,y,z0,...,zy1,x,zy+1,...,zt1)(z_y,y,z_0,...,z_{y-1},x,z_{y+1},...,z_{t-1})is its coupling node.If the coupled node of the node is itself, the node is not included in the coupled pair, as shown in the red part in Figure 4. Analyzing the definition of the coupling node, we can find that the connection of the coupling pair is divided by the mathematical formula:yyaxis, and the coupling pair has the same mathematical formula: yycomponent, and its associated span follows the mathematical formula : Increases with increasing value of y=0y=0the coupling pairs only connect across the single layer; and for the mathematical formula: y=1y=1the coupling pairs only connect across the double layer. Such a combined design is similar to the mathematical formula of the offset of the permutation function in Zigzag:eie_igrows exponentially, which can make the coupling pairs of nodes in different areas interleaved in different layers of the system, so that the system can use more when repairing. Data with fewer layers is associated with more unused data.

To sum up, starting from multiple representations of data, Clay codes provide multi-level information associations for data in storage nodes by means of coupled pairs and data transformations, thereby bypassing the coding design such as Zigzag codes, and for all The MDS code has been expanded into MSR, which is an excellent erasure code design.

3.3.3 summary

Section 3.3 introduces two MRS code construction schemes with unlimited parameters. Both of them are designed through clever combination to generate cross-layer association on the basis of data layering, and then reuse as little data as possible to repair damaged data. . However, neither the Zigzag code nor the Clay code can avoid the challenges faced by the MSR code theory itself: in order to achieve the theoretical repair flow of the MSR code, the number of subpackets of the code itself must at least reach the mathematical formula:(nk)nnk(n-k)^{\lceil\frac{n}{n-k}\rceil}.This means that when our encoding parameters are slightly expanded, the number of sub-blocks in each data block will increase exponentially. Even for common parameters such as (14,4), the number of sub-packets reaches 256. The negative impact of this nature in the system is reflected in: the I/O performance during data restoration is greatly reduced due to the lack of access continuity. At this time, the disk I/O will become the bottleneck of the entire system and seriously affect the normal operation of the system. . This is also one of the reasons why regeneration codes are not used on a large scale in addition to the complexity of encoding and decoding.

3.4 Local Reconstruction Code

In order to optimize single-disk repair traffic, the industry has proposed another code, which is LRC first proposed and applied by Microsoft Azure. LRC groups data blocks and generates local parity blocks for each group of data blocks, so that single-disk repair only requires the participation of nodes in the group. Due to the simplicity of LRC and its suitability to business needs, the industry has a lot of practical experience in LRC. In this chapter, we will explain the implementation principle of LRC in detail, and introduce several different construction methods.

3.4.1 core design

LRC usually has 3 positive integer parameters Mathematical formula:k,l,gk,l,g(k,l,g)(k,l,g)LRC divides mathematical formula:kkdata blocks into mathematical formula:llgroups, generate 1 local check block in each group, and generate mathematical formula:kkglobal check blocks through mathematical formula:ggdata blocks.Math in the same stripe: k+l+gk+l+g blocks are stored on separate storage devices. When a single block failure occurs in a stripe, LRC will start the local repair algorithm, and only use other blocks belonging to the same group as the faulty block for data decoding and repair; when the number of faults is greater than 1, LRC will use global + local parity blocks for decoding Repair, this repair process is similar to RS code.

Taking Figure 7 as an example, the mathematical formulas of local check blocks: px,pyp_x,p_yare coded and generated by data block mathematical formulas: x0,x1,x2x_0,x_1,x_2 and mathematical formulas: y0,y1,y2y_0,y_1,y_2respectively, and they are responsible for group Partial repair of inner data blocks. The mathematical formula of the global parity block: p0,p1p_0,p_1is generated by the mathematical formula: x0,x1,x2,y0,y1,y2x_0,x_1,x_2,y_0,y_1,y_2to ensure the fault tolerance of the entire stripe.

图片

Figure 7. Schematic diagram of (k=6,l=2,g=2) LRC code

It can be seen from the structural form that when only one data block is lost, LRC only needs to read other data blocks in the group corresponding to the faulty block and the local parity block to decode. Compared with the (n, k)RS code that needs to read k blocks to complete the repair process, the data transmission overhead of (k, l, g)LRC in single block repair is only about k/l data blocks; at the same time, The calculation amount of LRC during repair is also greatly reduced with the reduction of the amount of participating data, which ensures that the entire erasure code system can restore damaged data at a faster speed and continue to provide user services. However, due to the addition of local check blocks, LRC has expanded storage overhead, and it no longer has the nature of MDS. Through good coefficient design, (k, l, g)LRC can tolerate any g+1 concurrent faults, and can also repair some fault scenarios where the number of bad blocks is more than g+1 but less than g+l.

It is worth mentioning that, in addition to using parameters (k, l, g) to describe, LRC can also be described by another parameter (n, k, r). Where n represents the total number of blocks in the stripe, k represents the number of data blocks, and r represents the upper limit of the number of non-local parity blocks in a single group. Using this set of parameters can better describe the grouping of LRC and facilitate the analysis of coding theory. There are currently many construction methods of LRC, and their fault tolerance capabilities are also different. We will briefly introduce them in the next subsection.

3.4.2 Various construction methods

In this section we will discuss four different LRC constructions: Azure-LRC, Azure-LRC+1 and Xorbas. Their fault-tolerance capabilities are shown in Table 2, and we will briefly introduce their construction next.

Table 2 Fault tolerance of different types of LRC

LRC typeExtremely small code distance, that is, the maximum number of any bad disks that can be tolerated +1
Azure-LRCd=nkkr+2d=n-k-\lceil \frac{k}{r} \rceil+2
Azure-LRC+1d=nkkr+1d=n-k-\lceil \frac{k}{r} \rceil+1
Xorbasd=g+1d=g+1
3.4.2.1 Azure-LRC

Azure-LRC is the original LRC construction method mentioned in Section 3.4.1, so I won’t repeat it here. Taking Figure 8 as an example, the maximum number of any bad disks that can be tolerated is10663+21=310-6-\lceil \frac{6}{3} \rceil + 2 -1 = 3.However, there is a flaw in this construction method: for the repair of the global check block, even if it is a single block fault repair, Azure-LRC needs to pay a large repair cost, which may make the node where the global check block is located become the entire system hotspots, which in turn affects the load balancing of the entire system.

图片

Figure 8. Schematic diagram of (n=10,k=6,r=3)Azure-LRC

3.4.2.2 Azure-LRC+1

In order to solve the problem of the high cost of repairing the Azure-LRC global check block, its improved version Azure-LRC+1 came into being. Based on Azure-LRC, this structure adds additional local checksums to the global checksum. Taking Figure 9 as an example, it adds L2 partial check on the basis of Figure 7, its fault tolerance is the same as Azure-LRC, and it is also a mathematical formula: 11663+11=311-6-\lceil \frac{6}{3} \rceil + 1 -1 = 3. Compared with the unequal status of data blocks and check blocks in Azure-LRC, the repair costs of the two in Azure-LRC+1 are exactly the same, which means that Azure-LRC+1 maintains the premise of maintaining the same fault tolerance In this case, the problem of uneven stripe repair overhead can be solved only by adding an extra block of storage overhead.

图片

Figure 9. Schematic diagram of (n=13,k=6,r=3)Azure-LRC+1

3.4.2.3 Xorbas

Xorbas solves the problem of high repair overhead of the Azure-LRC global check block from the perspective of encoding coefficients. In the design of Xorbas, the XOR sum of its local check blocks is equal to the XOR sum of the global check blocks. When a single block failure occurs in the global check block, Xorbas can gather l local check blocks instead of k Survival blocks are used to restore bad blocks (l is often smaller than k).

Taking Figure 10 as an example, its coding structure is exactly the same as that of Azure-LRC, but the coding is specially designed. The specific coding matrix is as follows, and the selection of coefficients can be found in the appendix of the reference [23]:

[100000010000001000000100000010000001α0,0α0,1α0,2α0,3α0,4α0,5α1,0α1,1α1,2α1,3α1,4α1,5α0,0+α1,0α0,1+α1,1α0,2+α1,2000000α0,3+α1,3α0,4+α1,4α0,5+α1,5][D0D1D2D3D4D5]=[D0D1D2D3D4D5G0G1L0L1]\left[\begin{matrix}1&0&0&0&0&0\\0&1&0&0&0&0\\0&0&1&0&0&0\\0&0&0&1&0&0\\0&0&0&0&1&0\\0&0&0&0&0&1\\\alpha_{0,0}&\alpha_{0,1}&\alpha_{0,2}&\alpha_{0,3}&\alpha_{0,4}&\alpha_{0,5}\\\alpha_{1,0}&\alpha_{1,1}&\alpha_{1,2}&\alpha_{1,3}&\alpha_{1,4}&\alpha_{1,5}\\\alpha_{0,0}+\alpha_{1,0}&\alpha_{0,1}+\alpha_{1,1}&\alpha_{0,2}+\alpha_{1,2}&0&0&0\\0&0&0&\alpha_{0,3}+\alpha_{1,3}&\alpha_{0,4}+\alpha_{1,4}&\alpha_{0,5}+\alpha_{1,5}\\\end{matrix}\right]*\left[\begin{matrix}D_0\\D_1\\D_2\\D_3\\D_4\\D_5\end{matrix}\right]=\left[\begin{matrix}D_0\\D_1\\D_2\\D_3\\D_4\\D_5\\G_0\\G_1\\L_0\\L_1\end{matrix}\right]

图片

Figure 10. Schematic diagram of (k=6,l=2,g=2)Xorbas

3.4.3 小结

In general, LRC is an erasure code optimized for single-disk repair traffic and suitable for engineering implementation. By sacrificing the nature of MDS, some parity blocks are used for local parity within the group to achieve rapid recovery of single disk failures. Because it does not have MDS properties, (k,l,g)LRC can only guarantee that any g+1 concurrent faults can be repaired at most, and the frequency of concurrent faults above g+1 in large-scale clusters cannot be ignored, which makes The application of LRC in high fault-tolerant scenarios is limited.

3.5 Summary and Comparison

Table 3 summarizes several important indicators of all erasure codes introduced above, including storage redundancy, upper limit of fault tolerance, systemicity, MDS, single-block repair traffic, and number of subpackets. In a real system, enterprises should evaluate different erasure codes according to business requirements and difficulty of engineering implementation, and then choose the most suitable type for deployment to meet the fault tolerance requirements of the system.

Table 3 Index comparison of different erasure codes

encoding typeparameterstorage redundancyUpper limit of fault tolerance系统性MDS 性单块修复流量分包数
Vandermonde RS code(n,k)(n,k)nk\frac{n}{k}nkn-kkk11
Cauchy RS Code(n,k)(n,k)nk\frac{n}{k}nkn-kkk11
Cauchy RS Code
(Xor version)
(n,k)(n,k)nk\frac{n}{k}nkn-kkkww
(有限域为 GF(2w)GF(2^w))
Zigzag Code(n,k)(n,k)nk\frac{n}{k}nkn-kn1nk\frac{n-1}{n-k}(nk)k1(n-k)^{k-1}
Clay Code(n,k)(n,k)nk\frac{n}{k}nkn-kn1nk\frac{n-1}{n-k}(nk)nnk(n-k)^{\lceil\frac{n}{n-k}\rceil}
Azure-LRC(n,k,r)(n,k,r)nk\frac{n}{k}nkkr+1n-k-\lceil \frac{k}{r} \rceil+1×kl\frac{k}{l}11
Azure-LRC+1(n,k,r)(n,k,r)nk\frac{n}{k}nkkrn-k-\lceil \frac{k}{r} \rceil×kl\frac{k}{l}11
Xorbas Code(k,l,g)(k,l,g)k+l+gk\frac{k+l+g}{k}gg×kl\frac{k}{l}11

4.Application and Practice of Erasure Code in CubeFS

The erasure code system of CubeFS supports the deployment of multiple erasure code modes, and can provide CubeFS-LRC, Azure-LRC+1 and optimized versions according to the storage redundancy required by the business, AZ-level disaster recovery, and cross-AZ repair traffic costs RS and other codes. When selecting code, the system needs to consider multiple factors such as storage redundancy, IO fan-out, and whether AZ disaster recovery is required. If AZ disaster recovery is required, multi-AZ coding is required. In the case of the same redundancy, IO fan-out needs to be considered. For example, RS(4,2) and RS(8,4) have the same redundancy, but the latter has a larger fan-out, which means that files of the same size will be divided into The more copies, the smaller the amount of data per piece of data, and the higher the probability of long-tail delays. Multiple factors need to be considered when choosing a code.

图片

Figure 11. Multi-mode erasure coding deployment of CubeFS

In order to alleviate the possibility that the high fan-out of erasure codes may increase the probability of long-tail delay problems, we adopt the read-write mechanism of writing quorum, and write any n+t copies and return success (t can be configured according to the actual situation). Reading data will initiate n+t read requests (t is configurable, the larger t requires higher bandwidth, it is recommended to set it to 1 in the production environment), and after any first n copies of data are returned, the original data can be returned by decoding and restoring. Readers here will consider whether decoding every time data is read will affect performance. In reality, each piece of data is generally 8M, and the average decoding speed can reach more than 5GB/s, so decoding here will not become the bottleneck of the entire IO. Of course, we also support direct reading of specific data blocks. For the detailed process, please refer to the previous [CubeFS Storage Technology Demystification Series Articles].

The codec engine library used by CubeFS is https://github.com/klauspost/reedsolomonopen in new window,open in new window which supports instruction set acceleration.

4.1 CubeFS-LRC

In CubeFS we implemented a simple LRC encoding by two RS calculations. In CubeFS-LRC, we believe that the status of the global parity block and the data block are the same. When encoding, it first distributes the global parity block to different AZs, and then the data block inside the AZ and the global parity block The block recalculates the RS code once to obtain the local parity block.

type lrcEncoder struct {
	engine      reedsolomon.Encoder  //用于全局校验块计算
	localEngine reedsolomon.Encoder  //用于本地校验块计算
}

图片

Figure 12. Simple schematic of CubeFS-LRC

The advantage of this method is that the engineering implementation is relatively simple, and the simple RS codec can meet the requirements, and at the same time, the data restoration of a bad disk can be realized inside the AZ. The disadvantage is that in some scenarios, m+1 block data damage cannot be recovered, and at the same time, the cross-AZ traffic generated when the data block is updated is large. But after all, the scenario where m+1 blocks of data are damaged is very rare, as long as the data is repaired in time, this situation will generally not occur.

4.2 Azure-LRC + 1

CubeFS-LRC implements LRC by calculating the RS code twice. The disadvantage is that it cannot guarantee that the scene will be repaired when the m+1 block is damaged. The Azure-LRC+1 encoding can tolerate arbitrary m+1 block data corruption. In the actual production environment, we limit each AZ to generate only one local check block, because in most scenarios, a disk is damaged, and a repair will be initiated immediately after the disk is damaged.

In the implementation of AzureLRC+1, we generate a special matrix, and then use the encoding interface of the engine library to generate a global check block and a local check block through one encoding. We construct the generator matrix of AzureLRC+1 by extending the Vandermonde matrix in Jearsure form. For (n,m,k)-AzureLRC+1, we first construct a (n,m+1)-RS Jearsure form Vandermonde matrix, and then group the first row (all 1 row) of the check coefficient Split to obtain the local check block coefficient of the data block, and finally add all the check coefficient rows except the first row to obtain the local check block coefficient of the global check block. The figure below is a schematic diagram of the generation matrix structure of (8,4,3)-AzureLRC+1.

[100000000100000000100000000100000000100000000100000000100000000111111111155397384181225217139217161926017290173161239220132152411727023514334200101][100000000100000000100000000100000000100000000100000000100000000115539738418122521713921716192601729017316123922013215241172702351433420010111110000000011110245252369147138254]\left[\begin{matrix}1&0&0&0&0&0&0&0\\0&1&0&0&0&0&0&0\\0&0&1&0&0&0&0&0\\0&0&0&1&0&0&0&0\\0&0&0&0&1&0&0&0\\0&0&0&0&0&1&0&0\\0&0&0&0&0&0&1&0\\0&0&0&0&0&0&0&1\\1&1&1&1&1&1&1&1\\1&55&39&73&84&181&225&217\\1&39&217&161&92&60&172&90\\1&73&161&239&220&132&15&24\\1&172&70&235&143&34&200&101\\\end{matrix}\right]\Rightarrow\left[\begin{matrix}1&0&0&0&0&0&0&0\\0&1&0&0&0&0&0&0\\0&0&1&0&0&0&0&0\\0&0&0&1&0&0&0&0\\0&0&0&0&1&0&0&0\\0&0&0&0&0&1&0&0\\0&0&0&0&0&0&1&0\\0&0&0&0&0&0&0&1\\1&55&39&73&84&181&225&217\\1&39&217&161&92&60&172&90\\1&73&161&239&220&132&15&24\\1&172&70&235&143&34&200&101\\1&1&1&1&0&0&0&0\\0&0&0&0&1&1&1&1\\0&245&25&236&91&47&138&254\\\end{matrix}\right]

// buildMatrixSpecialJerasure creates the incomplete encoding matrix from buildMatrixJerasure
//
// The top square of the matrix is guaranteed to be an identity
// matrix, which means that the data shards are unchanged after
// encoding.
//
// we generate an totalShards+1 * dataShards Jerasure matrix firstly.
// Then we will delete the XOR-sum row to fit the property of AzureLrc+1's encoding matrix
//
// e.g. to encode the global parity of AzureLrc+1 (n=4, m=2, l=3)
// we will use the matrix on the left side below as the basic vandermonde matrix ((4+2+1)*4)
//
//	 	[[1, 1,  1,   1],			   [[1,   0,   0,   0],
//		 [1, 2,  4,   8],				[0,   1,   0,   0],
//		 [1, 3,  5,  15],				[0,   0,   1,   0],
//		 [1, 4, 16,  64],	 ----->     [0,   0,   0,   1],
//		 [1, 5, 17,  85],				[1,   1,   1,   1],
//		 [1, 6, 20, 120],				[1, 123, 166, 244],
//		 [1, 7, 21, 107]]				[1,  82, 245, 167]]
//
// Then we use elementary transformations to get the jerasure style matrix
// on the right side above and take the last two row as the encoding coefficient.
// This mean we use [[1,123,166,244],[1,82,245,167]] to encode the global parity.
//
// This function is used to encoding the GLOBAL parity of AzureLrc+1
func buildMatrixSpecialJerasure(dataShards, totalShards int) (matrix, error) 

// buildMatrixAzureLrcP1 creates the entire encoding matrix
// with dimensions of (n+m+l)*n for (n,m,l)-AzureLrc+1
//
// The top (n+m)*n row of the matrix is generated by
// buildMatrixSpecialJerasure to caculate the global parity.
//
// The n+m+1 to n+m+l-1 row is 0-1 vector which is used to
// generate the DATA-AZ's local parity. (this is as same as the
// jerasure style vandermonde matrix whose parity number equal to 1)
//
// The last row is the sum of the n+1 to n+m row, which means
// that the PARITY-AZ's local parity is the XOR-sum of all golbal parity.
//
// Warnings: This function has the following limitations for arguments
// * localParityShards is equal to AZ count , actually.
// This means we have only one LOCAL parity in each AZ.
func buildMatrixAzureLrcP1(dataShards, globalParityShards, localParityShards int) (matrix, error)

If a disk is damaged, the repair process inside the AZ is the same as the CubeFS-LRC process, and will not be repeated here. When m+1 blocks are damaged in AzureLRC+1, we need to select appropriate k blocks of data from the remaining n+k-1 surviving blocks for data repair. At this time, CubeFS will first call the GetSurvivalShards() interface to obtain the correct surviving block ID, and then use the k surviving data for decoding operations. The only difference between this process and RS is that RS code only needs to randomly obtain k pieces of surviving data to decode, while AzureLRC needs to select appropriate k pieces of data.

// GetSurvivalShards allows system to select the
// CORRECT n shards to form an invertable decode
// matrix.
//
// the first []int is selected survival index,
// which determines the decode matrix
//
// the second []int is real read shards index,
// which actually participate the decoding
//
// This function is specially design for LRC.
func GetSurvivalShards(badIndex []int, azLayout [][]int) ([]int, []int, error)

4.3 Cross-AZ traffic optimization of RS code

LRC encoding can save cross-AZ network traffic, of course, it will increase the storage cost of each local check block in the same situation. If you do not want to increase storage costs, but also want to save cross-AZ network traffic. We provide a compromise method. On the basis of traditional RS codes, local decoding is used to reduce cross-AZ traffic. We use the (12,4) RS in Figure 13 as an example to demonstrate this technology in engineering Applications.

图片

Figure 13 Schematic diagram of local decoding of RS codes in multi-AZ scenarios

Let the data layout be as shown in Figure 13. When the D1 data fails, if a simple aggregation and reconstruction method is adopted, the cross-AZ traffic generated by the repair process will reach 5~8 data blocks (depending on the number of data blocks involved in the repair). distribution of survival data). Obviously adopting this simple repair process would introduce huge cross-AZ traffic, and this cost can be cut.

We use the linearity of RS coding to split the repair equation of damaged data into multiple sub-equations. Each sub-equation only corresponds to the calculation of the equation in AZ, and then only the intermediate verification data calculated by the sub-equation is transmitted across AZ. Take Figure 13 as an example, if the linear expression of D1 is:

D1=i=47α1,iDi+i=03β1,iPiD_1 = \sum_{i=4}^7 \alpha_{1,i}*D_i +\sum_{i=0}^3 \beta_{1,i}*P_i

Then we only need to calculate the first component of the formula in AZ1, calculate the second component in AZ2, and then send it to the new node of AZ0 through the network, and add the two components to restore the damaged data D1. With this implementation, we were able to reduce RS cross-AZ repair traffic from 5~8 blocks to 2 blocks.

At the implementation level, we provide the corresponding function PartialReconstruct() for partial decoding, and the data of the intermediate check block can be calculated by passing in the specified parameters.

// PartialReconstruct is specifically designed for reconstruction
// optimization, which supports the partial decoding with the
// selected survival shards in an AZ
//
// Assum we need to use D0 = x1*D1 + x2*D2 + x3*D3 + x4*P0 to reconstruct
// the D0, while D1 、D2 are stored in AZ0 and D3 、P0 are stored in AZ1.
// Using PartialReconstruct() allows the upper-level application to compute
// intermediate parity containing the specified shards.
// For example, if we pass (shards[][],[D1,D2,D3,P0],[D0]) to the PartialReconstruct()
// and shards[][] only contains D1 and D2, this function will only compute
// x1*D1 + x2*D2 and ADD it into D0.
// P.S. [D1,D2,D3,P0] is used to determine the decoding matrix
//
// The length of the array must be equal to Shards.
// You indicate that a shard is missing by setting it to nil
func PartialReconstruct(shards [][]byte, survivalIdx, badIdx []int) error

4.4 summary

From the perspective of engineering implementation and code maintenance, CubeFS-LRC is a very simple implementation. It only needs to reuse the RS encoding engine for two calculations, and it can well solve the problem of restoring cross-AZ traffic on a single disk with RS code. Of course, in addition to the global parity block, the local parity block introduced by CubeFS-LRC can only be used to repair and optimize the bad disk in AZ, and cannot provide fault tolerance. The maximum number of any bad disks that CubeFS-LRC can support is only equal to the global parity block. test number. And if it is necessary to implement the update operation of the erasure code, CubeFS-LRC has to bear more overhead in terms of data update.

If you want to introduce partial parity blocks that can also provide fault tolerance, and still solve the problem of cross-AZ traffic repair on a single disk, then Azure-LRC+1 will be a better choice than CubeFS-LRC. In addition, due to the equal status of the data block and the check block in AzureLRC+1, the overhead of repairing and updating is roughly the same, which is more conducive to the system's load balancing.

If you are not willing to bear the additional storage overhead of local check blocks introduced by LRC encoding, and want to minimize the repair traffic across AZs in bad disk scenarios, you can consider using traditional RS codes supplemented by local decoding as a compromise solution. Find a business-acceptable balance between overhead and repair traffic.

references

  1. I.S. Reed and G. Solomon, "Polynomial Codes over Certain Finite Fields." Journal of the society for industrial and applied mathematics, 1960, 8(2):300-304
  2. J.S. Plank and Y. Ding "Note: Correction to the 1997 tutorial on Reed-Solomon coding." Software: Practice and Experience, 2005, 35(2):189-194
  3. L. Rizzo. "Effective Erasure Codes for Reliable Computer Communication Protocols." ACM SIGCOMM Computer Communication Review, 1997, 27(2):24-36
  4. J. Blomer, M. Kalfane, M. Karpinski, et al. "An XOR-based Erasure-resilient Coding Scheme.", Technical Report TR-95-048, International Computer Science Institute, 1995
  5. J. S. Plank and Lihao Xu, "Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications," Fifth IEEE International Symposium on Network Computing and Applications*,*2006, pp. 173-180
  6. Y. Uezato, "Accelerating XOR-Based Erasure Coding using Program Optimization Techniques," International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1-15.
  7. S. Mitra, R. Panta, M. Ra, and S. Bagchi. "Partial-parallel-repair(PPR): A Distributed Technique for Repairing Erasure Coded Storage." Eleventh European Conference on Computer Systems, 2016, pp. 1-16.
  8. S. Lin, G. Gong, Z. Shen. "Boosting Full-Node repair in Erasure-Coded storage" USENIX Annual Technical Conference, 2021, pp. 641-655.
  9. A. Dimakis, P. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran,“Network Coding for Distributed Storage Systems,” IEEE Transactions on Information Theory, 2010, 56(9): 4539–4551.
  10. V. Cadambe, A. Mazumdar "An Upper Bound on the Size of Locally Recoverable Codes", International Symposium on Network Coding . 2013,pp. 1-5.
  11. S. Goparaju, I. Tamo, R. Calderbank. "An Improved Sub-packetization Bound for Minimum Storage Regenerating Codes." IEEE Transactions on Information Theory, 2014, 60(5): 2770-2779.
  12. B. Sasidharan, G. K. Agarwal, P. V. Kumar. "A High-rate MSR Code with Polynomial Sub-packetization Level",IEEE International Symposium on Information Theory. 2015: 2051-2055.
  13. M. Ye, A. Barg,  "Explicit Constructions of Optimal-access MDS Codes with Nearly Optimal Sub-packetization." IEEE Transactions on Information Theory, 2017, 63(10): 6307-6317.
  14. I. Tamo, Z. Wang, J. Bruck, "Zigzag codes: MDS Array Codes with Optimal Rebuilding." IEEE Transactions on Information Theory, 2013, 59(3): 1597–1616.
  15. K.V. Rashmi, P. Nakkiran, J. Wang, et al. "Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage and Network-bandwidth",The 13th USENIX Conference on File and Storage Technologies, 2015: 81–94.
  16. K. Kralevska, D. Gligoroski, R. E. Jensen, et al. "Hashtag Erasure Codes: From Theory to Practice." IEEE Transactions on Big Data, 2018, 4(4): 516-529.
  17. L. Pamies-Juarez, F.  Blagojević, R. Mateescu, et al. "Opening the Chrysalis: On the Real Repair Performance of MSR Codes" The 14th USENIX Conference on File and Storage Technologies,2016,pp. 81-94.
  18. M. Vajha, V. Ramkumar, B. Puranik, et al. "Clay codes: Moulding MDS Codes to Yield An MSR Code",The 16th USENIX Conference on File and Storage Technologies,2018, pp.139–153.
  19. Y. Shan, K. Chen, T. Gong, et al. "Geometric partitioning: Explore the Boundary of Optimal Erasure Code Repair",The ACM SIGOPS 28th Symposium on Operating Systems Principles.2021, pp.457–471.
  20. X. Li, K. Cheng, P. C. C. Lee, et al. "ParaRC: Embracing Sub-Packetization for Repair Parallelization in MSR-Coded Storage",The 21st USENIX Conference on File and Storage Technologies,2023, pp.17-31.
  21. C. Huang, H. Simitci, Y. Xu, et al. "Erasure Coding in Windows Azure Storage." USENIX Annual Technical Conference , 2012,pp.2-2.
  22. I. Tamo and A. Barg, "A Family of Optimal Locally Recoverable Codes," 2014 IEEE International Symposium on Information Theory, 2014, pp.686-690.
  23. M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, et al, "XORing Elephants: Novel Erasure Codes for Big Data.", In 39th International Conference on Very Large Data Bases, 2013.
  24. O. Kolosov, G. Yadgar, M. Liram, et al, "On Fault Tolerance, Locally, and Optimality in Locally Repairable Codes.", ACM Transaction on Storage, 2020, 16(2):1-32.
  25. Y. Hu, L. Cheng, Q. Yao, et al, "Exploiting Combined Locally for Wide-Stripe Erasure Coding in Distributed Storage.", 19th USENIX Conference on File and Storage Technologies, 2021, pp.233-248.
  26. Z. Shen, J. Shu, P. P. C. Lee, "Reconsidering Single Failure Recovery in Clustered File Systems.", 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2016, pp. 323-334.
  27. S. Kadekodi, S. Silas, D. Clausen, et al, "Practical Design Considerations for Wide Locally Recoverable Codes(LRCs).", 21st USENIX Conference on File and Storage Technologies, 2023, pp. 1-16.