RAID TECHNOLOGY
RAID (redundant array of independent disks,
originally redundant array of inexpensive disks) is a storage technology
that combines multiple disk
drive components into a logical unit.
Data is distributed across the drives in one of
several ways called "RAID levels", depending on the level of redundancy and performance required.
The term "RAID" was first defined by David Patterson, Garth A. Gibson,
and Randy Katz at the University of California, Berkeley in 1987.
Marketers representing industry RAID manufacturers
later attempted to reinvent the term to describe a redundant array of
independent disks as a means of disassociating a low-cost expectation from
RAID technology.
RAID is now used as an umbrella term for computer data storage schemes that can divide and replicate data among multiple physical drives: RAID is an
example of storage virtualization and the array can be accessed by the operating system as
one single drive.
The different
schemes or architectures are named by the word RAID followed by a number (e.g.
RAID 0, RAID 1). Each scheme provides a different balance between the
key goals: reliability and availability, performance, and capacity. RAID
levels greater than RAID 0 provide protection against unrecoverable
(sector) read errors, as well as whole disk failure.
History
Norman Ken Ouchi at IBM was awarded a 1978 U.S. patent 4,092,732 titled "System for recovering data
stored in failed memory unit." The claims for this
patent describe what would later be termed RAID 5 with full stripe writes. This
1978 patent also mentions that drive mirroring or duplexing (what would later
be termed RAID 1) and protection with dedicated parity (that would later
be termed RAID 4) were prior
art at that time.
The term RAID was first defined by David A. Patterson, Garth A. Gibson and
Randy Katz at the
University of California, Berkeley, in 1987. They studied the possibility of
using two or more drives to appear as a single device to the host system and
published a paper: "A Case for Redundant Arrays of Inexpensive Disks
(RAID)" in June 1988 at the SIGMOD conference.
Standard levels
A number of standard schemes have evolved which are
referred to as levels. There were five RAID levels originally conceived,
but many more variations have evolved, notably several nested levels and many non-standard levels (mostly proprietary). RAID levels and their associated data formats
are standardized by the Storage Networking Industry Association
(SNIA) in the Common RAID Disk Drive Format (DDF) standard:
RAID 0
RAID 0 (block-level striping without parity or mirroring) has no
(or zero) redundancy. It provides improved performance and additional storage
but no fault tolerance. Any drive failure destroys the array, and the
likelihood of failure increases with more drives in the array.
RAID 1
In RAID 1 (mirroring without parity or
striping), data is written identically to two drives, thereby producing a
"mirrored set"; the read request is serviced by either of the two
drives containing the requested data, whichever one involves least seek time plus rotational latency. Similarly, a write request
updates the stripes of both drives. The write performance depends on the slower
of the two writes (i.e. the one that involves larger seek time and rotational latency). At least two drives are
required to constitute such an array. While more constituent drives may be
employed, many implementations deal with a maximum of only two. The array
continues to operate as long as at least one drive is functioning.
RAID 2
In RAID 2 (bit-level striping with
dedicated Hamming-code parity), all disk spindle rotation is synchronized, and
data is striped such that each sequential bit is on a different drive. Hamming-code parity
is calculated across corresponding bits and stored on at least one parity
drive. This theoretical RAID level is not used in practice.
RAID 3
In RAID 3 (byte-level striping with
dedicated parity), all disk spindle rotation is synchronized, and data are
striped so each sequential byte
is on a different drive. Parity is calculated across corresponding bytes and
stored on a dedicated parity drive. Although implementations exist, RAID 3
is not commonly used in practice.
RAID 4
RAID 4 (block-level striping with
dedicated parity) is equivalent to RAID 5 (see below) except that all
parity data are stored on a single drive. In this arrangement files may be
distributed between multiple drives. Each drive operates independently,
allowing I/O requests to be performed in parallel.
RAID 4 was previously used primarily by NetApp, but has now been largely replaced
by an implementation of RAID 6 (RAID-DP).
RAID 5
RAID 5 (block-level striping with
distributed parity) distributes parity along with the data and requires all
drives but one to be present to operate; the array is not destroyed by a single
drive failure. Upon drive failure, any subsequent reads can be calculated from
the distributed parity such that the drive failure is masked from the end user.
RAID 5 requires at least three disks.[5]
RAID 6
RAID 6 (block-level striping with double
distributed parity) provides fault tolerance up to two failed drives. This
makes larger RAID groups more practical, especially for high-availability
systems. This becomes increasingly important as large-capacity drives lengthen
the time needed to recover from the failure of a single drive. Like
RAID 5, a single drive failure results in reduced performance of the
entire array until the failed drive has been replaced and the associated data
rebuilt.[5]
RAID 10
In RAID 10, often referred to as RAID 1+0
(mirroring and striping), data is written in stripes across primary disks that
have been mirrored to the secondary disks.
Comparison
The following table provides an overview of the most
important parameters of standard RAID levels. In each case:
- Array space efficiency is given as an expression in terms of the number of drives, ; this expression designates a fractional value between zero and one, representing the fraction of the sum of the drives' capacities that is available for use. For example, if three drives are arranged in RAID 3, this gives an array space efficiency of (approximately 66%); thus, if each drive in this example has a capacity of 250 GB, then the array has a total capacity of 750 GB but the capacity that is usable for data storage is only 500 GB.
- Array failure rate is given
as an expression in terms of the number of drives, , and the drive failure
rate, (which is assumed to be
identical and independent for each drive). For example, if each of three
drives has a failure rate of 5% over the next three years, and these
drives are arranged in RAID 3, then this gives an array failure rate
of over the next 3 years.
Level
|
Description
|
Minimum #
of drives**
|
Space
efficiency
|
Fault tolerance
|
Array
failure rate***
|
Read performance
|
Write performance
|
Figure
|
2
|
1
|
0 (none)
|
1−(1−r)n
|
nX
|
nX
|
|||
Mirroring
without parity or striping
|
2
|
1/n
|
n−1 drives
|
rn
|
nX
|
1X
|
||
3
|
1 − 1/n ⋅ log2(n-1)
|
RAID 2
can recover from one drive failure or repair corrupt data or parity when a
corrupted bit's corresponding data and parity are good.
|
Variable
|
Variable
|
Variable
|
|||
Byte-level
striping with dedicated parity
|
3
|
1 − 1/n
|
1 drive
|
½n(n−1)r2
|
(n−1)X
|
(n−1)X*
|
||
Block-level
striping with dedicated parity
|
3
|
1 − 1/n
|
1 drive
|
½n(n−1)r2
|
(n−1)X
|
(n−1)X*
|
||
Block-level
striping with distributed parity
|
3
|
1 − 1/n
|
1 drive
|
½n(n−1)r2
|
(n−1)X*
|
(n−1)X*
|
||
Block-level
striping with double distributed parity
|
4
|
1 − 2/n
|
2 drives
|
⅙n(n-1)(n-2)r3
|
(n−2)X*
|
(n−2)X*
|
||
Mirroring
without parity, and block-level striping
|
4
|
2/n
|
1 drive /
span ****
|
nX
|
(n/2)X
|
|||
Level
|
Description
|
Minimum # of
drives**
|
Space efficiency
|
Fault tolerance
|
Array failure
rate***
|
Read performance
|
Write performance
|
Figure
|
* Assumes hardware is
fast enough to support
** Assumes a non-degenerate minimum number of drives
*** Assumes independent, identical rate of failure amongst drives
**** Raid 10 can only lose 1 drive per span up to the max of 2/n drives
** Assumes a non-degenerate minimum number of drives
*** Assumes independent, identical rate of failure amongst drives
**** Raid 10 can only lose 1 drive per span up to the max of 2/n drives
Nested (hybrid) RAID
In what was originally termed hybrid RAID,
many storage controllers allow RAID levels to be nested. The elements of a RAID
may be either individual drives or RAIDs themselves. However, if a RAID is
itself an element of a larger RAID, it is unusual for its elements to be
themselves RAIDs.
As there is no basic RAID level numbered larger than
9, nested RAIDs are usually clearly described by attaching the numbers
indicating the RAID levels, sometimes with a "+" in between. The
order of the digits in a nested RAID designation is the order in which the
nested array is built: For a RAID 1+0, drives are first combined into
multiple level 1 RAIDs that are themselves treated as single drives to be
combined into a single RAID 0; the reverse structure is also possible
(RAID 0+1).
The final RAID is known as the top array. When the
top array is a RAID 0 (such as in RAID 1+0 and RAID 5+0), most
vendors omit the "+" (yielding RAID 10 and RAID 50, respectively).
- RAID 0+1: striped sets in a mirrored set (minimum four drives; even number of drives) provides fault tolerance and improved performance but increases complexity.
The key difference from RAID 1+0
is that RAID 0+1 creates a second striped set to mirror a primary striped
set. The array continues to operate with one or more drives failed in the same
mirror set, but if drives fail on both sides of the mirror the data on the RAID
system is lost.
- RAID 1+0: (a.k.a. RAID 10) mirrored sets in a striped set (minimum four drives; even number of drives) provides fault tolerance and improved performance but increases complexity.
The key difference from RAID 0+1
is that RAID 1+0 creates a striped set from a series of mirrored drives.
The array can sustain multiple drive losses so long as no mirror loses all its
drives.
- RAID 5+3: mirrored striped set with distributed parity (some manufacturers label this as RAID 53)
RAID parity
Many RAID levels employ an error protection scheme
called "parity", a widely used method in information technology to
provide fault tolerance in a given set of data. Most use the simple XOR parity
described in this section, but RAID 6 uses two separate parities based
respectively on addition and multiplication in a particular Galois Field or Reed–Solomon error correction.
Non-standard levels
Many configurations other than the basic numbered
RAID levels are possible, and many companies, organizations, and groups have
created their own non-standard configurations, in many cases designed to meet
the specialized needs of a small niche group. Most non-standard RAID levels are
proprietary:
- Linux MD RAID10 (RAID 10) implements a general RAID driver that defaults to a standard RAID 1 with two drives, and a standard RAID 1+0 with four drives, but can have any number of drives, including odd numbers. MD RAID 10 can run striped and mirrored, even with only two drives with the f2 layout (mirroring with striped reads, giving the read performance of RAID 0; normal Linux software RAID 1 does not stripe reads, but can read in parallel).
- Hadoop has a RAID system that generates a parity file by xor-ing a stripe of blocks in a single HDFS file.
Data backup
A RAID system used as secondary storage is not an alternative to backing up data. In RAID levels > 0, a
RAID protects from catastrophic data loss caused by physical damage or errors
on a single drive within the array (or two drives in, say, RAID 6).
However, a true backup system has other important features such as the ability
to restore an earlier version of data, which is needed both to protect against software errors that write unwanted data to
secondary storage, and also to recover from user error and
malicious data deletion.
A RAID can be overwhelmed by catastrophic failure
that exceeds its recovery capacity and, of course, the entire array is at risk
of physical damage by fire, natural disaster, and human forces, while backups
can be stored off-site.
A RAID is also vulnerable to controller failure
because it is not always possible to migrate a RAID to a new, different
controller without data loss.
Implementations
The distribution of data across multiple drives can
be managed either by dedicated computer hardware
or by software. A software solution may be part of
the operating system, or it may be part of the firmware and drivers supplied
with a hardware RAID controller.
Software-based RAID
Software RAID implementations are now provided by
many operating systems. Software RAID can be
implemented as:
- A layer that abstracts multiple devices, thereby providing a single virtual device (e.g. Linux's md)
- A more generic logical volume manager (provided with most server-class operating systems, e.g. Veritas or LVM)
- A component of the file system (e.g. ZFS or Btrfs)
Volume manager support
Server class operating systems typically provide logical volume management, which allows a system to use
logicalvolumes which can be resized or moved. Often, features like RAID or
snapshots are also supported.
- Vinum is a logical volume manager supporting RAID 0, RAID 1, and RAID 5. Vinum is part of the base distribution of the FreeBSD operating system, and versions exist for NetBSD, OpenBSD, and DragonFly BSD.
- Solaris SVM supports RAID 1 for the boot filesystem, and adds RAID 0 and RAID 5 support (and various nested combinations) for data drives.
- Linux LVM supports RAID 0 and RAID 1.
- HP's OpenVMS provides a form of RAID 1 called "Volume shadowing", giving the possibility to mirror data locally and at remote cluster systems.
File-system support
Some advanced file systems are
designed to organize data across multiple storage devices directly (without
needing the help of a third-party logical volume manager).
- ZFS supports equivalents of RAID 0, RAID 1, RAID 5 (RAID Z), RAID 6 (RAID Z2), and a triple parity version RAID Z3, and any nested combination of those like 1+0. ZFS is the native file system on Solaris, and also available on FreeBSD.
- Btrfs supports RAID 0, RAID 1, and RAID 10 (RAID 5 and 6 are under development).
Operating-system support
Many operating systems provide basic RAID
functionality independently of volume management:
- Apple's OS X and OS X Server support RAID 0, RAID 1, and RAID 1+0.
- FreeBSD supports RAID 0, RAID 1, RAID 3, and RAID 5, and all nestings via GEOM modules and ccd.
- Linux's md supports RAID 0, RAID 1, RAID 4, RAID 5, RAID 6, and all nestings. Certain reshaping/resizing/expanding operations are also supported.
- Microsoft's server operating systems support RAID 0, RAID 1, and RAID 5. Some of the Microsoft desktop operating systems support RAID. For example, Windows XP Professional supports RAID level 0, in addition to spanning multiple drives, but only if using dynamic disks and volumes. Windows XP can be modified to support RAID 0, 1, and 5. Windows 8 and Windows Server 2012 introduces a RAID-like feature known as Storage Spaces, which also allows users to specify mirroring, parity, or no redundancy on a folder-by-folder basis.
- NetBSD supports RAID 0, 1, 4, and 5 via its software implementation, named RAIDframe.
Over time, the increase in commodity CPU speed has
been consistently greater than the increase in drive throughput; the percentage
of host CPU time required to saturate a given number of drives has decreased.
For instance, under 100% usage of a single core on a 2.1 GHz Intel
"Core2" CPU, the Linux software RAID subsystem (md) as of
version 2.6.26 is capable of calculating parity information at 6 GB/s;
however, a three-drive RAID 5 array using drives capable of sustaining a
write operation at 100 MB/s only requires parity to be calculated at the
rate of 200 MB/s, which requires the resources of just over 3% of a single
CPU core.
Another concern with software implementations is the
process of booting the associated operating system. For instance, consider a
computer being booted from a RAID 1 (mirrored drives); if the first drive
in the RAID 1 fails, then a first-stage boot loader
might not be sophisticated enough to attempt loading the second-stage
boot loader from the second drive as a fallback. The second-stage
boot loader for FreeBSD is capable of loading a kernel from a RAID 1.
Hardware-based RAID
Hardware RAID controllers use proprietary data
layouts, so it is not usually possible to span controllers from different
manufacturers. They do not require processor resources, the BIOS can boot from
them, and tighter integration with the device driver may offer better error handling.
On a desktop system, a hardware RAID controller may
be an expansion card connected to a bus (e.g. PCI or PCIe), a component
integrated into the motherboard;
there are controllers for supporting most types of drive technology, such as IDE/ATA, SATA, SCSI, SSA, Fibre Channel, and
sometimes even a combination. The controller and drives may be in a stand-alone
enclosure, rather
than inside a computer, and the enclosure may be directly attached to a computer, or connected via a SAN.
Firmware/driver-based RAID
A RAID implemented at the level of an operating
system is not always compatible with the system's boot process, and it is
generally impractical for desktop versions of Windows (as described above).
However, hardware RAID controllers are expensive and proprietary. To fill this
gap, cheap "RAID controllers" were introduced that do not contain a
dedicated RAID controller chip, but simply a standard drive controller chip
with special firmware and drivers; during early stage bootup, the RAID is
implemented by the firmware, and once the operating system has been more completely
loaded, then the drivers take over control. Consequently, such controllers may
not work when driver support is not available for the host operating system.
Data scrubbing / Patrol read
Data scrubbing is periodic reading and checking by
the RAID controller of all the blocks in a RAID, including those not otherwise
accessed. This allows bad blocks to be detected before they are used.
An alternate name for this is patrol read.
This is defined as a check for bad blocks on each storage device in an array,
but which also uses the redundancy of the array to recover bad blocks on a
single drive and reassign the recovered data to spare blocks elsewhere on the
drive.
Problems with RAID
Correlated failures
In practice, the drives are often the same age (with
similar wear) and subject to the same environment. Since many drive failures
are due to mechanical issues (which are more likely on older drives), this
violates those assumptions; failures are in fact statistically correlated.[5] In
practice, the chances of a second failure before the first has been recovered
(causing data loss) is not as unlikely as four random failures. In a study
including about 100,000 drives, the probability of two drives in the same
cluster failing within one hour was observed to be four times larger than was
predicted by the exponential statistical distribution which characterizes
processes in which events occur continuously and independently at a constant
average rate. The probability of two failures within the same 10-hour period
was twice as large as that which was predicted by an exponential distribution.
A common assumption is that "server-grade"
drives fail less frequently than consumer-grade drives. Two independent studies
(one by Carnegie Mellon University and the other by Google) have shown that the
"grade" of a drive does not relate to the drive's failure rate.
Unrecoverable Read Errors (URE) during rebuild
Unrecoverable Read Errors present as sector read
failures. The UBE (Unrecoverable Bit Error) rate is typically specified at 1
bit in 1015 for enterprise class drives (SCSI, FC, SAS), and 1 bit in 1014 for desktop class drives
(IDE/ATA/PATA, SATA). Increasing drive capacities and large RAID 5
redundancy groups have led to an increasing inability to successfully rebuild a
RAID group after a drive failure because an unrecoverable sector is found on
the remaining drives. Parity schemes such as RAID 5 when rebuilding are
particularly prone to the effects of UREs as they will affect not only the
sector where they occur but also reconstructed blocks using that sector for
parity computation; typically an URE during a RAID 5 rebuild will lead to
a complete rebuild failure.
Double protection schemes such as RAID 6 are
attempting to address this issue, but suffer from a very high write penalty.
Non-parity (mirrored) schemes such as RAID 10 have a lower risk from UREs.
Background scrubbing can be used to detect and recover from UREs (which are
latent and invisibly compensated for dynamically by the RAID controller) as a
background process, by reconstruction from the redundant RAID data and then
re-writing and re-mapping to a new sector; and so reduce the risk of
double-failures to the RAID system.
Recovery time is increasing
Drive capacity has grown at a much faster rate than
transfer speed, and error rates have only fallen a little in comparison.
Therefore, larger capacity drives may take hours, if not days, to rebuild. The
re-build time is also limited if the entire array is still in operation at
reduced capacity.[42] Given a
RAID with only one drive of redundancy (RAIDs 3, 4, and 5), a second failure
would cause complete failure of the array. Even though individual drives' mean time between failure (MTBF) have increased over time,
this increase has not kept pace with the increased storage capacity of the
drives. The time to rebuild the array after a single drive failure, as well as
the chance of a second failure during a rebuild, have increased over time.[43] Mirroring
schemes such as RAID 10 have a bounded recovery time as they require the
copy of a single failed drive, compared with parity schemes such as RAID 6
which require the copy of all blocks of the drives in an array set. Triple
parity schemes, or triple mirroring, have been suggested as one approach to
improve resilience to an additional drive failure during this large rebuild
time.[44]
Atomicity: including parity inconsistency due to system crashes
A system crash or other interruption of a write
operation can result in states where the parity is inconsistent with the data
due to non-atomicity of the write process, such that the parity cannot be used
for recovery in the case of a disk failure (the so-called RAID 5 write hole).
This is a little understood and rarely mentioned
failure mode for redundant storage systems that do not utilize transactional
features. Database researcher Jim Gray wrote "Update in Place is a
Poison Apple" during the early days of relational database
commercialization.
RAID write hole
The RAID write hole is a known data corruption issue
in older and low-end RAIDs, caused by interrupted destaging of writes to disk.
Write cache reliability
A concern about write cache reliability exists,
specifically regarding devices equipped with a write-back cache—a caching
system which reports the data as written as soon as it is written to cache, as
opposed to the non-volatile medium.
Drive error recovery algorithms
Many modern drives have internal error recovery
algorithms that can take upwards of a minute to recover and re-map data that
the drive fails to read easily. Frequently, a RAID controller is configured to
drop a component drive (that is, to assume a component drive has failed) if
the drive has been unresponsive for 8 seconds or so; this might cause the array
controller to drop a good drive because that drive has not been given enough
time to complete its internal error recovery procedure. Consequently, desktop
drives can be quite risky when used in a RAID, and so-called enterprise class
drives limit this error recovery time in order to obviate the problem.
A fix specific to Western Digital's desktop drives
used to be known: A utility called WDTLER.exe could limit a drive's error
recovery time; the utility enabled TLER (time limited error recovery), which
limits the error recovery time to 7 seconds. Around September 2009, Western
Digital disabled this feature in their desktop drives (e.g. the Caviar Black
line), making such drives unsuitable for use in a RAID.
However, Western Digital enterprise class drives are
shipped from the factory with TLER enabled. Similar technologies are used by
Seagate, Samsung, and Hitachi. Of course, for non-RAID usage, an enterprise
class drive with a short error recovery timeout that cannot be changed is
therefore less suitable than a desktop drive.
In late 2010, the Smartmontools
program began supporting the configuration of ATA Error Recovery Control,
allowing the tool to configure many desktop class hard drives for use in a
RAID.
Scenarios other than disk failure
While RAID may protect against physical drive
failure, the data are still exposed to operator, software, hardware, and virus
destruction. Many studies cite operator fault as the most common source of
malfunction,[49] such as
a server operator replacing the incorrect drive in a faulty RAID, and disabling
the system (even temporarily) in the process.
RAID 5 in enterprise environments
Rebuilding a RAID 5 array after a failure will
add additional stress to all of the working drives, because every area on every
disc marked as being "in use" must be read to rebuild the redundancy
that has been lost. If drives are close to failure, the stress of rebuilding
the array can be enough to cause another drive to fail before the rebuild has
been finished, and even more so if the server is still accessing the drives to
provide data to clients, users, applications, etc. Even without complete loss
of an additional drive during rebuild, an unrecoverable read error (URE) is
likely for large arrays which will typically lead to a failed rebuild. Thus, it
is during this rebuild of the "missing" drive that the entire
RAID 5 array is at risk of a catastrophic failure. The rebuild of an array
on a busy and large system can take hours and sometimes days. Therefore, it is
not surprising that, when systems need to be highly available and highly
reliable or fault tolerant, other levels, including RAID 6 or
RAID 10, are chosen.
With a RAID 6 array, using drives from multiple
sources and manufacturers, it is possible to mitigate most of the problems
associated with RAID 5. The larger the drive capacities and the larger the
array size, the more important it becomes to choose RAID 6 instead of
RAID 5. RAID 10 also minimises these problems.
As of August 2012, Dell, Hitachi, Seagate, Netapp,
EMC, HDS, SUN Fishworks and IBM have current advisories against the use of
RAID 5 with high capacity drives and in large arrays.
Non-RAID drive architectures
Non-RAID drive architectures also exist, and
are often referred to, similarly to RAID, by standard acronyms, several
tongue-in-cheek. A single drive is referred to as a SLED (Single Large
Expensive Disk/Drive), by contrast with RAID, while an array of drives without
any additional control (accessed simply as independent drives) is referred to,
even in a formal context such as equipment specification, as a JBOD (Just a
Bunch Of Disks). Simple concatenation is referred to as a "span".
No comments:
Post a Comment