Ring File Formats¶
The ring is the most important data structure in Swift. How this data structure been serialized to disk has changed over the years.
Initially ring files contain three key pieces of information:
the part_power value (often stored as
part_shift := 32 - part_power
)which determines how many partitions are in the ring,
the device list
which includes all the disks participating in the ring, and
the replica-to-part-to-device table
which has all
replica_count * (2 ** part_power)
partition assignments.
But the ability to extend the serialization format to add more data structures to the ring serialization format has meant a new ring v2 format has been created.
Ring files have always been gzipped when serialized, though the inner, raw format has evolved over the years.
Ring v0¶
Initially, rings were simply pickle dumps of the RingData object. With Swift 1.3.0, this changed to pickling a pure-stdlib data structure, but the core concept was the same.
Ring v1¶
Pickle presented some problems, however. While there are security concerns around unpickling untrusted data, security boundaries are generally drawn such that rings are assumed to be trusted. Ultimately, what pushed us to a new format were performance considerations.
Starting in Swift 1.7.0, Swift began using a new format (while still being willing to read the old one). The new format starts with some magic so we may identify it as such:
+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+
where <vrs>
is a network-order two-byte version number (which is always 1).
After that, a JSON object is serialized as:
+---------------+-------...---+
| <data-length> | <data ... > |
+---------------+-------...---+
where <data-length>
is the network-order four-byte length (in bytes) of
<data>
, which is the ASCII-encoded JSON-serialized object. This object
has at minimum three keys:
devs
for the device listpart_shift
(i.e.,32 - part_power
)replica_count
for the integer number of part-to-device rows to read
The replica-to-part-to-device table then follows:
+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
| ... |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...|
+-------+-------+...+
Each <dev>
is a host-order two-byte index into the devs
list. Every row
except the last has exactly 2 ** part_power
entries; the last row may
have the same or fewer.
The metadata object has proven quite versatile: new keys have been added to provide additional information while remaining backwards-compatible. In order, the following new fields have been added:
byteorder
specifies whether the host-order for the replica-to-part-to-device table is “big” or “little” endian. Added in Swift 2.12.0, this allows rings written on big-endian machines to be read on little-endian machines and vice-versa.next_part_power
indicates whether a partition-power increase is in progress. Added in Swift 2.15.0, this will have one of two values, if present: the ring’s currentpart_power
, indicating that there may be hardlinks to clean up, orpart_power + 1
indicating that hardlinks may need to be created. See the documentation for more information.version
specifies the version number of the ring-builder that was used to write this ring. Added in Swift 2.24.0, this allows the comparing of rings from different machines to determine which is newer.
Ring v2¶
The way that v1 rings dealt with fractional replicas made it impossible to reliably serialize additional large data structures after the replica-to-part-to-device table. The v2 format has been designed to be extensable.
The new format starts with magic similar to v1:
+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+
where <vrs> is again a network-order two-byte version number (which is now 2). By bumping the version number, we ensure that old versions of Swift refuse to read the ring, rather than misinterpret the content.
After that, a series of BLOBs are serialized, each as:
+-------------------------------+-------...---+
| <data-length> | <data ... > |
+-------------------------------+-------...---+
where <data-length>
is the network-order eight-byte length (in bytes) of
<data>
. Each BLOB is preceded by a Z_FULL_FLUSH
to allow it to be
decompressed without reading the whole file.
The order of the BLOBs isn’t important, although they do tend to be written in the order Swift will read them while loading. This reduces the disk seeks necessary to load.
The final BLOB is an index: a JSON object mapping named sections to an array of offsets within the file, like
{
section: [
compressed start,
uncompressed start,
compressed end,
uncompressed end,
checksum method,
checksum value
],
...
}
Section names may be arbitrary strings, but the “swift/” prefix is reserved
for upstream use. The start/end values mark the beginning and ending of the
section’s BLOB. Note that some end values may be null
if they were not
known when the index was written – in particular, this will be true for
the index itself. The checksum method should be one of "md5"
, "sha1"
,
"sha256"
, or "sha512"
; other values will be ignored in anticipation
of a need to support further algorithms. The checksum value will be the
hex-encoded digest of the uncompressed section’s bytes. Like end values,
checksum data may be null
if not known when the index is written.
Finally, a “tail” is written:
the gzip stream is flushed with another
Z_FULL_FLUSH
,the stream is switched to uncompressed,
the eight-byte offset of the uncompressed start of the index is written,
the gzip stream is flushed with another
Z_FULL_FLUSH
,the eight-byte offset of the compressed start of the index is written,
the gzip stream is flushed with another
Z_FULL_FLUSH
, andthe gzip stream is closed; this involves:
flushing the underlying deflate stream with
Z_FINISH
writing
CRC32
(of the full uncompressed data)writing
ISIZE
(the length of the full uncompressed datamod 2 ** 32
)
By switching to uncompressed, we can know exactly how many bytes will be written in the tail, so that when reading we can quickly seek to and read the index offset, seek to the index start, and read the index. From there we can do similar things for any other section.
Seek to the end of the file
Go back 31 bytes in the underlying file; this should leave us at the start of the deflate block containing the offset for the compressed start
Decompress 8 bytes from the deflate stream to get the location of the compressed start of the index BLOB
Seek to that location
Read/decompress the size of the index BLOB
Read/decompress the json serialized index.
Note
This 31 bytes is the deflate block containing the 8 byte location,
a Z_FULL_FLUSH
block, the Z_FINISH
block, and the CRC32
and
ISIZE
. For more information, see RFC 1951 (for the deflate stream)
and RFC 1952 (for the gzip format).
The currently defined section and section names upstream are as follows:
swift/index
- The swift indexswift/ring/metadata
- Ring metadata serialized as jsonswift/ring/devices
- Devices json serialized data structure.This has been seperated from the ring metadata structure in v1 as it gets large
swift/ring/assignments
- The ring replica2part2dev_id data structure
Note
Third-parties may find it useful to add their own sections; however,
the swift/
prefix is reserved for future upstream enhancements.
swift/ring/metadata¶
This BLOB is an ASCII-encoded JSON object full of metadata, similar to v1 rings. It has the following required keys:
part_shift
dev_id_bytes
specifies the number of bytes used for each<dev>
in the replica-to-part-to-device table; will be one of 2, 4, or 8
Additionally, there are several optional keys which may be present:
next_part_power
version
Notice that two keys are no longer present: replica_count
is no longer
needed as the size of the replica-to-part-to-device table is explicit, and
byteorder
is not needed as all data in v2 rings should be written using
network-order.
swift/ring/devices¶
This BLOB contains a list of swift device dictionarys. And was seperated out from the metadata BLOB as this can become a large structure in it’s own right.
swift/ring/assignments¶
This BLOB is the replica-to-part-to-device table. It’s length will be
replicas * (2 ** part_power) * dev_id_bytes
, where replicas
is the exact
(potentially fractional) replica count for the ring. Unlike in v1, each
<dev>
is written using network-order.
Note that this is why we increased the size of <data-length>
as compared to
the v1 format – otherwise, we may not be able to represent rings with both
high replica_count
and high part_power
.