* Instead of performing the sender portion of work of generating checksums on-demand, it is performed once when the file is "published" and saved in a zsync metadata file
* This zsync metadata file is fetched (simple copy) and the receiver uses it to decide which portions of the file it needs to request. It then requests only those portions.
* Because of the simplification, the protocol can be reduced to work over simple stateless http. Any HTTPD that supports range requests can be a zsync server. Remote zsync files are represented by http urls.
* Note, this all but removes the CPU requirement of the sender/server.
I've used zsync in some very large systems to efficiently distribute write-few read-often files with only partial changes to many endpoints. Much more scalable than rsync due to the lack of CPU cost for the server/sender.
I also maintain a fork of zsync which runs using libcurl rather than the original author's custom http client code. This fork is primarily to support SSL: https://github.com/eam/zsync
It's a cool project, check it out!
Use zsync to distribute a small number of large files that have small changes. If you need to rsync hierarchies with lots of files, rsync is still king.
The additional process is to generate and send a list of filenames and metadata attributes (which rsync must do as well) and to invoke zsync per-file only if an update is necessary. For large trees of files which are largely unchanged this is very efficient - much moreso than fetching a zsync manifest per-file.
The file path is generally the largest amount of data sent per-file, prior to sending the zsync manifest. This is similar to rsync.
"A well-designed communications protocol has a number of characteristics."
<list of characteristics>
"Rsync's protocol has none of these good characteristics."
...
"It unfortunately makes the protocol extremely difficult to document, debug or extend. Each version of the protocol will have subtle differences on the wire that can only be anticipated by knowing the exact protocol version."
This is why it is very hard to implement a client program that can communicate with the standard rsync deamon on a server. You can always use the rsync program itself to communicate with the server, but this is not always an option. If it is - it can get ugly. On windows, you need cygwin or similar to run rsync.exe, which can complicate the deployment of your desktop app or shell extension.
An easy rsync client API would be useful if you were building an app that can store files on an rsync server, because the rsync utility and the rsync algorithm are great ways to efficiently syncronize files.
I tried deploying updates to a (pre-existing already deployed) website to a Windows-server machine using rsync once.
The site which running fine in the first place instantly stopped working, because rsync didn't merely copy the files over, but it completely reset the existing ACLs and permissions on all the files. The result was that the webserver no longer had permission to access the website's files. It was repeatable for every sync.
Needless to say, I found it less than optimal.
rsync 3 does not need to create or transfer the entire file list - in fact, it will start immediately, and will have no idea how many files are left -- it's not uncommon for it to always say "just 1000 more files left" all the time while working through a million files. You can force it to prescan all files with -m ("--prune-empty-dirs" or something like that) if you insist.
Also, I might be mistaken, but I think rsync3 doesn't even transfer the entire file list to the other side - it will treat the directory like a file (which contains file names, attributes, and checksums), and transfer that using rsync. If nothing changed, this will take a few bytes. If something did, the entire directory listing is rsynced to the other side, and it will be determined recursively which files and directories actually need to be transferred -- with every directory that doesn't any changes skipped like a file that doesn't need any changes.
I have often wondered why it is that rsync is so life-saving-ly quick and how it is that a few small changes to a massive file (e.g. from mysqldump) can be copied up to a server from the slow end of an ADSL line so quickly. Now I know about the 'rolling checksum' I can see what is going on.
Note that I work with people who use 'FTP' to copy files, or even worse, people who find FTP too complicated and have to send me files on a 'Dropbox' thing so I can download them and upload them for them, notionally with 'FTP'. (I will use rsync instead, not least for the bandwidth control options).
I have even had micro-managers get me to get FTP to work on the server for them, despite my protestations about it being insecure (which it really is if you use a Windows PC and something like Filezilla).
Obviously I only use rsync and scp. Without aforementioned micro-managed requests I would not even know if FTP was installed on the server side.
My point is that it may be easy for a few folks here to criticise rsync, however, there are a lot of people, from clients to managers and even talented programmers that just don't have a clue about rsync and are stuck in some stone age of using things like FTP.
What does Windows as a client OS have to do with it? FTP is insecure because it transmits credentials in the clear and because it opens additional ports for the actual transfer of data. Neither of which are a concern of the client.
However, of notable attacks I have witnessed recently, the plain text file of Filezilla was the attack vector. Get that and away you go!
https://github.com/lloeki/rsync/blob/master/rsync.py
[0]: from http://blog.liw.fi/posts/rsync-in-python/ but this site has been on and off regularly, hence the scavenging straight from my browser cache. As of today, the site is up again but the bzr repo is out of order (and bzr is not exactly popular).
EDIT: Also since we're talking about rsync, do you think the following options are sufficient for syncing a folder hierarchy from the local disk to an external flash drive?
rsync -aW --delete /source /destination
My main concern is the W option, which skips the usual compression (that delays a lot the already long process of syncing) and might end up writing a lot of bytes and decaying the memory cells of flash storage.
Note that I haven't touched the configuration since I set it up. It's really great.
[1] - http://www.rsnapshot.org
rsync -aSH --delete --update /source/ /destination
if both file-systems are Unix/Linux.-S Sparse files remain sparse.
-H Hard links are preserved. Caveat: Big O(inodes).
Note the trailing slash on /source/ unless you want to copy to /destination/source/.
-W does not seem like a win for your local-to-local use case but may be the default:
unison is very slow compared to rsync. version at both ends must match (which means you'll likely need to compile your own unless all your machines run the same distro and version).
Don't rsync directly to the location you are running your application from. Instead, upload to a staging directory and then use a symlink to change from one version of your code to the next. Changing a symlink is an atomic operation.
We have a user called something like ~packages which has all the static code and assets in it. This users data should be read only from the users that run the actual services. Inside that user dir, we have version directories like tags/0.11.1/1, tags/0.11.1/2 and tags/0.11.2/1. These directories correspond to tags from our version control system.
Switching over to a new build just means stop service, change symlink, start. Some services don't need the stop and start part.
You can use hard links to make this process even better. Our build system uses the "--link-dest" option to specify the last build's directory when uploading a new build. This means that files that have not changed from the last build don't consume any extra space on the disk. Since the inodes are the same, they even stay in the file system cache after the deploy.
You can have lots of past versions sitting there on the server without taking up any space. If you have a bad deploy, and need to revert to a past version, just change the symlink again.
Here are some of the neat features of Rsync you can take advantage of for deployments:
* Fault tolerance: when an error happens at any layer (network, local i/o, remote i/o, etc), Rsync will report it to you. Trapping these errors will give you better insight into the status of your deployments.
* Authentication: the Rsync daemon supports its own authentication schemes.
* Logging: report various logs about the transfer process to syslog, and collect from these logs to learn about the deployment status.
* Fine-grained file access: use a 'filter', 'exclude' or 'include' to specify what files a user can read or write, so complex sets of access can be granted for multiple accounts to use the same set of files (you can also specify specific operations that will always be blocked by the daemon)
* Proper permissions: force the permissions of files being transferred, so your clients don't fuck up and transfer them with mode 0000 perms ("My deploy succeeded, but the files won't load on the server! Wtf?")
* Pre/post hooks: you can specify a command to run before the transfer, and after, making deployment set-up and clean-up a breeze.
* Checksums on file transfers for integrity
* Preserves all kinds of file types, ownership and modes, with tons of options to deal with different kinds of local/remote/relative paths, even if you aren't the super-user (including acls/xattrs)
* Tons of options for when to delete files and when to apply the files on the remote side (before, during or after transfer, depending on your needs)
* Custom user and group mapping
Are you sure rsync doesn't already do this?
Where I've found rsync really valuable is for good ol' regular file copying ("I just need to stick this one file or directory on a server"). I've pretty much stopped using scp and replaced it with rsync. rsync is awesome because:
1) you can resume interrupted transfers
2) it's much faster than scp when sending lots of small files
3) it's actually you know, a sync tool, as opposed to just a copy tool
If you miss the little progress bar that scp gives, you can also use --progress with rsync and then it's basically a drop in replacement.
And the file can change on the server just as easily as the client. So how can it tell this without sending the complete list?