Khanlou | Streaming Multipart Requests

November 14, 2018

Streaming Multipart Requests

Foundation’s URL loading is robust. iOS 7 brought the new URLSession architecture, making it even more robust. However, one thing that it’s never been able to do natively is multipart file uploads.

What is a multipart request?

Multipart encoding is the de facto way of sending large files over the internet. When you select a file as part of a form in a browser, that file is uploaded via a multipart request.

A multipart request mostly looks like a normal request, but it specifies a unique encoding for the HTTP request’s body. Unlike JSON encoding ({ "key": "value" }) or URL string encoding (key=value), multipart encoding does something a little bit different. Because of the body of a request is just a long stream of bytes, the entity parsing the data on the other side needs to be able to determine when one part ends and another begins. Multipart requests solve this problem using a concept of “boundaries”. In the Content-Type header of the request, you define a boundary:

Accept: application/json
Content-Type: multipart/form-data; boundary=khanlou.comNczcJGcxe

The exact content of the boundary is not important, it just needs to be a series of bytes that is not present in the rest of the body (so that it can meaningfully act as a boundary). You can use a UUID if you like.

Each part can be data (say, an image) or metadata (usually text, associated with a name, to form a key-value pair). If it’s an image, it looks something like this:

--<boundary>
Content-Disposition: form-data; name=<name>; filename=<filename.jpg>
Content-Type: image/jpeg

<image data>

And if it’s simple text:

--<boundary>
Content-Disposition: form-data; name=<name>
Content-Type: text/plain

<some text>

After the last part in the request, there’s one more boundary, which includes an extra two hyphens, --<boundary>--. (Also, note that the new lines all have to be CRLF.)

That’s pretty much it. It’s not a particularly complicated spec. In fact, when I was writing my first client-side implementation of multipart encoding, I was a bit scared to read the RFC for multipart/form-data. Once I took a look at it, however, I understood the protocol a lot better. It’s surprisingly readable, and nice to go straight to the source for stuff like this.

That first implementation was in the Backchannel SDK, which is still open source. You can see the BAKUploadAttachmentRequest and the BAKMultipartRequestBuilder, which combine to perform the bulk of the multipart handling code. This particular implementation only handles a single file and no metadata, but it serves as an example of how this request can be constructed. You could modify it to support metadata or multiple files by adding extra “parts”.

However, if you want to upload multiple files, whether your service expects one request with many files or many requests with one file each, you’ll soon run into a problem. If you try to upload too many files at once, your app will crash. Because this version of the code loads the data that will be uploaded into memory, a few dozen full-size images can crash even the most modern phones.

Streaming files off the disk

The typical solution to this is to “stream” the files off the disk. The idea is that the bytes for any given file stay on the disk, until the moment they’re needed, at which point they’re read off the disk and sent straight to the network machinery. Only a small amount of the bytes for the images is in memory at any given time.

I came up with two broad approaches to solving this problem. First, I could write the entire data for the multipart request body to a new file on disk, and use an API like URLSession’s uploadTask(with request: URLRequest, fromFile fileURL: URL) to stream that file to the API. This would work, but it seemed like it should be possible to accomplish this without writing a new file to the disk for each request. There would also be a bit of cleanup to handle after the request has been sent.

The second approach is to build a way to mix data in memory with data on the disk, and present an interface to the network loading architecture that looks like a chunk of data from a single source, rather than many.

If you’re thinking this sounds like a case for class clusters, you’re dead on. Many common Cocoa classes allow you to make a subclass that can act as the super class by implementing a few “primitive” methods. Think -count and -objectAtIndex: on NSArray. Since every other method on NSArray is implemented in terms of those two, you can write optimized NSArray subclasses very easily.

You can create an NSData that doesn’t actually read any data off the disk, but rather creates a pointer directly to the data on disk, so that it never has to load the data into memory to read it. This is called memory mapping, and it’s built on a Unix function called mmap. You can use this feature of NSData by using the .mappedIfSafe or .alwaysMapped reading options. And NSData is a class cluster, so if we could make a ConcatenatedData subclass (that works like FlattenCollection in Swift) that allows you treat many NSData objects as one continuous NSData, we would have everything we need to solve this problem.

However, when we look at the primitive methods of NSData, we see that they are -count and -bytes. -count isn’t a big deal, since we can easily sum up the counts of all the child datas; -bytes, however, poses a problem. -bytes should return a pointer to a single continuous buffer, which we won’t have.

To handle discontinuous data, Foundation gives us NSInputStream. Fortunately, NSInputStream is also a class cluster. We can create a subclass that makes many streams look like one. And with APIs like +inputStreamWithData: and +inputStreamWithURL:, we can easily create input streams that represent the files on the disk and the data in memory (such as our boundaries).

If you look at the major third-party networking libraries, you’ll see that this is how AFNetworking actually works. (Alamofire, the Swift version of AFNetworking, loads everything into memory but writes it to a file on disk if it is too big.)

Putting all the pieces together

You can see my implementation of a serial InputStream here in Objective-C and here in Swift.

Once you have something like SKSerialInputStream, you can combine streams together. Given a way to make the prefix and postfix data for a single part:

extension MultipartComponent {
    var prefixData: Data {
        let string = """
        \(self.boundary)
        Content-Disposition: form-data; name="\(self.name); filename="\(self.filename)"
        """
        return string.data(using: .utf8)
    }
    
    var postfixData: Data {
        return "\r\n".data(using: .utf8)
    }
}

You can compose the metadata for the upload and the dataStream for the file to upload together into one big stream.

extension MultipartComponent {
    var inputStream: NSInputStream {
        
        let streams = [
            NSInputStream(data: prefixData),
            self.fileDataStream,
            NSInputStream(data: postfixData),
        ]
    
        return SKSerialInputStream(inputStreams: streams)
    }
}

Once you can create an input stream for each part, you can join them all together to make an input stream for the whole request, with one extra boundary at the end to signal the end of the request:

extension RequestBuilder {
    var bodyInputStream: NSInputStream {
        let stream = parts
            .map({ $0.inputStream })
            + [NSInputStream(data: "--\(self.boundary)--".data(using: .utf8))]
    
        return SKSerialInputStream(inputStreams: streams)
    }
}

Finally, assigning that to the httpBodyStream of your URL request completes the process:

let urlRequest = URLRequest(url: url)

urlRequest.httpBodyStream = requestBuilder.bodyInputStream;

The httpBodyStream and httpBody fields are exclusive — only one can be set. Setting the input stream invalidates the Data version and vice versa.

The key to streaming file uploads is being able to compose many disparate input streams into one. Something like SKSerialInputStream makes the process a lot more manageable. Even though subclassing NSInputStream can be a bit perilous, as we’ll see, once it’s solved the solution to the problem falls out more or less naturally.

Subclassing notes

The actual process of subclassing NSInputStream wasn’t fun. Broadly put, subclassing NSInputStream is hard. You have to implement 9 methods, 7 of which can have trivial default implementations that the superclass should handle. The documentation only mentions 3 of the 9 methods, and you have to dig to find out that you have to implement 6 more methods from NSStream (the superclass of NSInputStream) including two run loop methods where empty implementations are valid. There also used to be 3 private methods that you had to implement, though they don’t seem to be a concern any longer. Once you’ve implemented these methods, you also have to support several readonly properties as well: streamStatus, streamError, and delegate need to be defined.

Once you get all the details about subclassing right, you get to the challenge of providing an NSInputStream that behaves in a way that other consumers would expect. The state of this class is heavily coupled in non-obvious ways.

There are a few pieces of state, and they all need to act in concert at all times. For example, hasBytesAvailable is defined separately from the other state, but there are still subtle interactions. I found a late-breaking bug where my serial stream would ideally return self.currentIndex != self.inputStreams.count for hasBytesAvailable, but that caused a bug where the stream stayed open forever and would cause a timeout for the request. I patched over this bug by always returning YES, but I never figured out the root of it.

Another piece of state, streamStatus, has many possible values, but the only ones that seem to matter are NSStreamStatusOpen and NSStreamStatusClosed.

The final interesting piece of state is the byte count, returned from the read method. In addition to returning a positive integer for the byte count, a stream can return a byte count of -1, which indicates an error, which then has to be checked by via the streamError property, which must no longer be nil. It can also return a byte count of 0, which, according to the documentation, is another way that to indicate that the end of the stream has been reached.

The documentation doesn’t guide you what combinations of states are valid. If my stream has a streamError, but the status is NSStreamStatusClosed rather than NSStreamStatusError, will that cause a problem? Managing all this state is harder than it needs to be, but it can be done.

I’m still not fully confident that SKSerialStream behaves perfectly in all situations, but it seems to support uploading multipart data via URLSession well. If you use this code and find any issues, please reach out and we can make the class better for everyone.

Update, August 2019: One final note I want to add on this code. I was seeing some small bugs with it, and I spent a few nights debugging. This was pretty painful, so I want to leave this note for readers and of course, future versions of myself. If you give you URLRequest a Content-Length, NSURLSession will use that to determine exactly how many bytes to pull out of your input stream. If the content length is too low, it will truncate your stream precisely at that byte. If the content length is too high, your request will timeout as it waits for bytes that will never come because the stream is exhausted. This, sadly, is not documented anywhere.