Khanlou | How to build a system that syncs

February 17, 2012

How to build a system that syncs

Syncing is an incredibly hard problem to solve. If you have a small operation, many people will tell you to try to work around it somehow, or to outsource your syncing to iCloud or to Google. I wasn’t able to find any resources when I was working on the syncing in Fireside, so I thought about the problem for a while and solved it in a pretty good way. This post won’t have a bunch of code to just copy and paste, but, rather, instructions and a framework for how to create a system that works for you.

Considerations

I’m assuming if you’re looking beyond iCloud and other such syncing services, you’ve already got a pretty complex problem to solve. Maybe you want a website where users can access their data, maybe you’re looking towards other platforms, like Android. No matter what, there are several considerations that you want to take into account when creating this system.

The basics are pretty straightforward. You need a language for the computers to use to talk to each other. I like JSON; it’s light, fast, and parsers exist for every platform. Other options are XML or even binary plists, if you’re working with a WebObjects server and serving data mostly to Apple devices. You need an API, which should be really well thought-through, since you can’t really control when your app gets updated by the user, and you’ll need to essentially support those API endpoints forever. And finally, you need to decide how much ephemerality you want your data to have. Twitter is a very transient medium, and so Twitter clients might not want to use data that’s more than few days old. Other platforms will need all of the data all of the time. A user might want a Basecamp client that always has all their data. This is a client side problem, but, nevertheless, deciding this will help you design your APIs.

The amount of data being synced

The first thing to think about when syncing is how much data a user might want to keep in sync. If it’s just a few preferences and text strings, it’s much easier to just serve them all of the data they need every time, than to try to figure out what data the device needs at a given point. This is how subscription syncing in Fireside works. Even a user with hundreds of subscriptions might be an an API call of a couple of kilobytes. Since the majority of the lag on wireless mobile networks is latency, a couple of KB of served JSON document is pretty inconsequential.

If you have any more data than that, you should start thinking about how to segment your data such that the device doesn’t have to download more data than it needs to. This kind of segmentation is usually done using time, i.e., only download data since the last time the client synced.

Also, consider the amount of data you want on the client. Twitter clients don’t (and reliably can’t) access all the data in your Twitter account, and Twitter is designed to show you what’s fresh and relevant, so it’s not even important for them to download old stuff to your device.

Quantization

The next step is to quantize your data. A quantum of data is essentially the smallest unit that needs to stay in sync. For a service like Fireside, these are the properties of the podcast, such as any detail about the user’s interactions with that podcast (i.e., stars, playback location, archival status, etc). For Twitter, it’s a tweet. Separating your data into units like this is probably already done, but it’s crucial for updating these units on the server in a way that assures the user’s intents are properly synched. We’ll discuss this more in conflict resolution.

Divorce the metadata from the data

Until Instapaper 4.0, the app didn’t show articles until the article itself was downloaded onto the device. This meant that the process was slow and the user had no idea how many articles were coming in until the app was done downloading the content of the article itself. Currently, as of 4.0.1, it shows the list, and lets the user know which articles haven’t downloaded yet. If you are syncing something with a significant amount of data (articles, podcasts, images), separate it from the metadata, since metadata downloads much faster, and you can present the user with that information before actually having it on the device. Something like tweets or Facebook wall posts are probably small enough to not worry about.

Offline data

Storing the data in a persistent way on the client side is crucial. It’s nice to pretend that we have always-connected devices, but many users live in rural areas with slower mobile networks, have bandwidth limitations, use non-cellular devices (like iPods and iPads), and are on planes and subways. I can’t count the number of times I’ve tried to use a service like Tweet Marker in an area with very high latency. It’s unbearable. Assuming that the server will always be available and will always be quick is a huge mistake. Clever usage of local databases and sync queues can make the user feel like your service is faster and more available than it actually is.

Queueing, offline syncing, and persistence

Not only should you worry about the data you read being available offline, you also should consider that the actions the user might take while offline. On the client, especially on wireless mobile networks, you can’t always rely on your HTTP connection successfully connecting. To solve this problem, you can queue the user’s intent to perform an action (like starring a tweet: I’m looking at you, Tweetbot), and only remove that action from the queue when you’ve seen that the HTTP connection has been completed. If your client is designed on iOS, using NSOperation and NSOperationQueue can make your life much easier, but it sort of behaves as a black box, and you have to use KVO to get any useful information out of it.

You also want to think about what happens when the user performs a bunch of actions (in the subway, for example) and closes the app, which gets quit by the system at some point. When the app opens again, the user will expect to see their actions having propagated across your system, so you have to store the users actions in a persistent way (serialized on disk) so that the system works the way the user expects. Offline actions aren’t strictly necessary, but can delight the user, and should at least be considered.

Conflict Resolution

Here’s where creating a sync API gets fun. If you have an offline, persistent queue, then the user might create conflicting states. She might change one of her objects A to a state B, at a point when the system doesn’t have network access; later, perhaps on a different device, she updates object A to state C successfully; and finally, the original device regains a network connection and tries to change the state of object A to state B. There’s several ways to handle this.

Do nothing. When the delayed state change B finally comes through, it overwrites the state C.
Last action wins. Sync intents come with a timestamp of when the user intended them to go through; if that timestamp is after the timestamp on the server for when that object was last changed, the sync intent is ignored. (NB: This is how Fireside works, and this is also why the quanta of Fireside’s podcasts are the properties about the podcast and not the podcast object itself. This way, the user can star a podcast on one device and update the progress on another, and because the timestamps of starring and progress are stored differently, both changes will go through as the user expects.)
Some other metric. You might have a progress property that should always be at the furthest location. Check this on the server, and set the state of the object to whatever is appropriate.
**Versioning. **With versioning, when a change to the state of an object conflicts with another change, the older change gets folded into the stack of changes and the user can return to it if she desires. I’ve heard whisperings that this is how iCloud works, but of course, we haven’t seen any access to these versions on the user side.
Prompt the user. Find a concise way of describing the differences, and then ask the user what they would like as the final state of the object. This is how MobileMe worked. I would recommend against this; I try to bother the user as little as possible.

Naming

What do you want to call your service? Sync implies keeping things in harmony; historically, it also has a bad rap. Sync is unreliable. Sync breaks. Sync is another thing to worry about. Apple used the word “sync” very little when debuting iCloud, opting for more active words like “push” and “send”. You might want to use the word “cloud”, if you don’t find that too offensive.

Redundancy

Your users are trusting you with their data. Having your server go down will make them unhappy, and losing their data will make them angry. Consider multiple forms of backup, security, and redundancy. These are annoying things for a small startup to worry about, but it’s better than losing all your user’s data. Netflix uses a tool called the Chaos Monkey that randomly takes down servers and services that Netflix relies on and generally wreaks havoc on their services. It’s excessive, but when half of the Internet died in December 2010 due to an AWS outage, Netflix carried on just fine, even if they did have a little lower bitrate on their movies. Netflix expected failure, and that helped them avoid it.

Canon

Where is the canonical copy of the data? In most situations, this is really easy to answer: it sits on the web server. If you are running a cloud storage or backup system, and you know that the user is only using one client device, the canonical copy of the data is on the device, and the server should never overwrite it.

Falling out of sync

The user’s data will fall out of sync, and this is a bug you will have to squash. You can try to correct it by providing the user with a way of clearing data out and redownloading from the canonical copy. You might use a hash or CRC to assure that the data is identical on multiple platforms, and create some kind of error correction.

Conclusions

Sync is a hard problem to solve, but it’s not impossible, and its benefits are obvious: it’s tremendously useful to users and can really differentiate you from your competitors. If there are things that I haven’t included in this short guide, please feel free to send me an email at soroush@khanlou.com or a tweet at @khanlou.