Google

Wayne's New Rsync Protocol

Why?

The main reason I'm working on a new protocol is to address the problem of having a really huge memory footprint for large file transfers. This is handled by providing an incremental approach to directory scanning, which allows even a really big hierarchy of files to be transferred starting immediately, and working on the files in between adding new directories to the transfer.

Another design goal was to make a simple tool that could be controlled by an external process to enable the traditional Unix goal of combining simple tools to create more powerful, really useful solutions. The new protocol allows things like the often proposed --file-list option right out of the box. It also allows files to be transferred in both directions in the same connection.

I'd also like to improve the preservation of hard-links. I have an idea for how to make this much more efficient than rsync, but I haven't implemented it yet.

Something else that I'd like to do is to provide better include/exclude control. I'm planning to use PCRE to allow the user to use regular expressions, while translating regular wildcard syntax into regular expressions to let the user specify patterns in a more basic syntax. Of course, this only comes into play if you're not controlling things for yourself, file-by-file (in which case you can provide whatever exclude control you like).

End Result

The end result of my work is to try to come up with something that can either be integrated back into rsync as a next-generation file-transfer protocol or to assist Martin Pool in the design of his new superlifter project.

Current Test-app Release

You'll find the current version of my rZync test-app here:

http://www.blorf.net/rZync-0.09.tar.bz2 (June 12, 2003)

Use the included "rs" perl script to drive the "rzync" executable, or code something that controls rzync yourself. See the README file for the list of available commands.

To build it, you should use at least version 0.9.6 version of librsync, found via the download link on this page:

http://librsync.sourceforge.net/

The Protocol

Overview

The basic idea of the new protocol is fairly simple: send a message # (1 - 63), any associated data (from 0 - 32767 bytes), and an optional name-cache reference, which I refer to as a sequence number, or "SEQ".

The name-cache keeps track of all directories and file names that we're currently processing, allowing the protocol to refer to each one by number. Unlike rsync's cache, the names are created incrementally during the run, and they also die after they are no longer needed (rsync creates the list at the start of processing, transfers it in its entirety over the connection, and keeps it around on both sides for the whole run).

The available messages handle the things you'd expect, such as being able to change directories, send info/error messages, start a file-transfer, create directories, delete files, create new cache-names, etc.

Message Interface

Some messages allow you to call them with either just a SEQ, or with a combination of a SEQ and a string. For instance, if you send a MSG_DELETE with just a SEQ, that cache-name item is deleted. However, if you include a string (which is what the data portion of the message is assumed to be), the name in the string is combined with the SEQ item (which must be a directory reference) to delete a file that is not in the name-cache.

To the programmer, the communications interface is thus fairly simple: calls are made to send_msg() with a message #, a data-pointer/data-length pair, and either a name-cache sequence number or "NO_SEQ". These values show up on the other side as a call to the handle_msg() function with the same args. In between is a set of heuristics designed to make the message-byte overhead as small as possible (messages can be as small as 1 byte -- see the "Gory Details" discussion below).

The Name Cache

The name-cache is one of the things that keeps the transferred-byte-count overhead down. It is fairly simple in concept, but has some extra complexities due to the fact that the sides are separated by a communications delay, yet they need to be able to create unique SEQ IDs that are shared by both sides of the connection without using an elaborate (and slow) handshake to allocate IDs. My solution (and there may certainly be better ones -- let me know) is to give each side their own "namespace" (positive values for the master side, negative values for the slave side) and let each side only create new IDs in their own namespace (each side using an incrementing/decrementing 31-bit counter). After one side creates a new ID, the very next message sent by that side must either directly or indirectly cause the opposite side to note the creation of the same ID on the opposite side. This causes both sides to create the ID at the same point in the datastream without having to actually send the value of the ID over the connection (since both sides are incrementing from the same starting point). This allows both sides to thereafter send messages that refer to the just-created name by its SEQ value.

To handle the removal of IDs from the name-cache, a reference count is maintained. In a simple world, the creating side would control the deletion of its own names, decrementing the ref count and sending a message that tells the opposite side to do the same (thus assuring that the cache stays perfectly in sync). However, this would add extra "I'm done" messages to the protocol that are not strictly needed. What I have done instead is to allow either side to decrement the ref count on a name that has reached the end of that side's processing, but if the ref count reaches zero the name is not immediately freed. This allows the other side to "resurrect" the name in certain circumstances (e.g. if a resend is needed). The code keeps the names in a simple hash-like structure, and ages names out of a dead-list after an appropriate delay. While this means that the aging of old entries does not occur exactly at the same time on both sides, this does not have an adverse effect on the protocol (all we care about is that we don't discard the name too soon).

Directory-Contents Cache

When both sides are synchronizing the contents of a directory, the list of directory items is a useful resource to keep down the transferred byte-count. What we do is to save the sorted list of names for each currently-active directory and allow the receiver to refer to names by the directory's SEQ ID and the index of a name within the associated list.

UIDs and GIDs

Another shared data item is the mapping of UIDs and GIDs between systems. We use a similar positive/negative index value to store the IDs in a way that transfers uniquely (and compactly) between systems. A complicating factor arises when a name/group value does not exist on both sides. My current UID/GID code still needs to be improved to handle this better, and has not been properly tested (since I haven't yet run the app as root).

Multi-part Messages for File Transfers

Many of the messages are acted on immediately by the other side and that's the end of it. However, several messages start a sequence of events that cause data to be generated, processing to happen on the other side, and then data to be received back on the first side. For instance the "start a signature" message that initiates a file transfer sends a "start delta" message to the other side, followed by all the signature data. Since only one stream of signature data will be generated at a time, the data does not have to contain any identifying markings (the start-delta message causes the other side to note what file is being worked on for all the arriving data messages). This does not mean that the sig-generating side waits to receive back the results of this signature stream before starting a new signature, just that we never interleave the data from multiple signatures into the data stream. It is very common for the signature-generating code to be working on a completely different file from what the patch-handling code is working on (both of which happen on the receiving side of the connection).

Directories are handled in a manner very similar to files: we use the rsync protocol to send a signature of the receiving side's sorted directory data, and receive back delta data from the sending side that lets the receiver reconstruct the sender's sorted directory data. This results in significant savings in bytes transfered as the directories get more and more similar.

Items still to be determined here are if we would benefit from using a different/variable rsync-blocksize for different-sized directories and exactly how directory information should be compressed for transfer. We currently use the same (fixed) blocksize for both files and directories and a slightly-improved compression scheme like rsync uses.

(Note that up until version 0.05, only one directory was handled at a time. This was way too slow if the only thing being sent was up-to-date directories, so the code was enhanced to be able to start multiple directory signatures before the first delta result makes its way back to be "patched".)

Data Flow

Also of importance is data flow on the pipe/socket and the avoidance of deadlock. On the receiving side, we send a lot of data down the output pipe in the form of signature data. We try to keep this pipe fairly full without monopolizing all of our internal output buffer. On the sending side we read signature data, compute deltas, and output delta data. We only handle one delta-generation process at a time, so data flow on the sending side can stop reading messages on the incoming pipe until the data on the outgoing pipe can be flushed. This make it imperative that the receiver always be able to process its input pipe without having to wait on the output to flush or we could deadlock. Fortunately most of the data coming in on the receiver's input pipe is destined for the disk, so this is not normally a problem. However, there is some data that loops back around, and we need to carefully control this.

The Result

The end result is a protocol that has enough flexibility to allow an external process to control what files/dirs get sent/received, making it possible to write your own controlling process OR use the built-in (recursive) directory scanning.

The Gory Details of the Message Byte-sequence

Here's the gory details about how the two sides create the actual messages that get sent over the pipe/socket to the other side:

What this means to the receiving code is that it takes from 1 to 3 bytes of message to be able to figure out how long the message is going to be. The current implementation tries to read 3 bytes, which might return from 1 - 3 bytes. A quick scan of the message # lets us know if we managed to read the whole message already, and if not, we're typically able to immediately setup to read the rest of the message in one more go.

Wayne's home page