Khanlou | Linear Probing

August 16, 2019

Linear Probing

A few years ago, I wrote Implementing Dictionary In Swift, which was a crash course on how to create something similar to Swift’s Dictionary type.

Since our dictionary will be backed with an array, we need a way to convert the key that’s given into an integer, and then a way to force that integer into the bounds of our array. These two methods are the hashing function and the modulus operation, respectively. By hashing, we can consistently turn our key (usually a string, but can be any type that is Hashable) into a number, and by taking the modulus with respect to the length of the array, we’ll get a consistent slot in the array to set and retrieve the value.

Because there can be collisions, each “bucket” needs to potentially contain multiple items, so the whole thing was modeled as an array of arrays. This was a less-than-performant shortcut to get code that was working up quickly. In the post, I noted that:

A nice implementation of this Dictionary might use a linked list for the values property of the Placeholder.

Since writing that post, I’ve learned a lot of things, but germane to the discussion here: this is a bad strategy. Linked lists are very good when memory is tightly limited, and you can’t reliably secure a long, unbroken block of memory. As memory in our computers has gotten bigger, CPUs have learned to predict and preload memory in long sequences, so working with continuous blocks of memory is remarkably faster to jumping around the memory space through a discontinuous chain.

So if we don’t want to use an array of arrays, and we don’t want to use an array of linked lists, what else can we do? It turns out that there is a Dictionary storage technique that avoids the problems with both solutions.

If you remember that any array of arrays can be “flattened” into one large array, a solution begins to take shape. If the bucket for a given hash value is taken, instead of storing a second value at the same location, put the new value in the neighboring bucket. Then, when searching for that element, if you see something where it should be, check the neighboring buckets until you find it (or an empty slot).

This technique is called “linear probing”. It was invented in 1954 and is the way that Swift’s Dictionary is actually implemented!

In this post, we’re going to writing a Dictionary that’s backed by linear probing. I’m going to use a test-driven development approach for a few reasons.

First, a key thing in Swift has changed since I wrote the last dictionary blog post. Swift 4.2 changed Swift’s hashing to be non-deterministic. This makes it really hard to write to simple tests in a playground, since any value’s hash changes between runs. We need to tightly control how our hash values are produced so that we can be sure that our Dictionary behaves appropriately as we refactor it.

Second, this algorithm behaves interestingly at the boundaries — that is, as it runs up against the end of the backing array and needs to roll over to the start of the array — and we need to make sure we handle that behavior correctly.

Third, deletion is really, really hard to get right with linear probing, and there are quite a few edge cases that need to be examined closely.

The technique also lends itself to some refactoring and restructuring, and having tests will be a great safety net to ensure that we don’t break old behavior as we make the code cleaner and simpler.

Getting Things Compiling

So, let’s get started. Test-driven development requires that you write a failing test before writing the code that makes it work, so let’s do that here.

func testGettingFromAnEmptyDictionary() {
    var dictionary = ProbingDictionary<String, String>()
    XCTAssertEqual(dictionary["a"], nil)
}

This test doesn’t compile, so we can call that failing. Let’s build the basics of the type and get this test passing.

struct ProbingDictionary<Key: Hashable, Value> {

    subscript(key: Key) -> Value? {
        get {
            return nil
        }
        set {
        }
    }

}

Not a very good dictionary, but at least it compiles and passes the first test.

Getting and Setting a Single Value

Next, let’s get it to be able to set and then get a single value. Test first:

func testSettingAndGettingAValue() {
    var dictionary = ProbingDictionary<String, String>()
    dictionary["a"] = "123"
    XCTAssertEqual(dictionary["a"], "123")
}

(The most hardcore TDD acolytes will change the implementation so that it can store exactly one value, probably in an instance variable, since that’s the bare minimum of code that you need to make this test pass. I think that’s kind of silly, so I will skip that step and jump straight to an implementation that works with multiple values.)

For the implementation, we need a few things already. We need some kind of Placeholder type again, like in the last post. However, instead of being able to store an array of key/value pairs, it’ll just store a key and a value, or nothing:

enum Item {
    case element(Key, Value)
    case empty
}

I also added some helpers for grabbing stuff like the key and value easily:

extension Item {
    var value: Value? {
        if case let .element(_, value) = self {
            return value
        }
        return nil
    }
}

We also need storage for the data in the dictionary:

private var storage: [Item] = Array(repeating: .empty, count: 8)

(8 is the initial size of the dictionary. In a real implementation, you might want to start with a bigger value, but for this example, smaller is easier to understand. We’ll implement resizing soon anyway.)

Next, we need to rewrite the get and set subscript methods. This looks a lot like the version of the dictionary from the last post:

subscript(key: Key) -> Value? {
    get {
        let index = abs(key.hashValue % storage.count)
        return storage[index].value
    }
    set {
        if let newValue = newValue {
            let index = abs(key.hashValue % storage.count)
            storage[index] = .element(key, newValue)
        }
    }
}

The abs ensures that negative values are handled correctly. This simple implementation makes our new test pass, but doesn’t handle collisions, so it’s time to write a test for that.

Handling Collision

We’d like to handle collisions, but we have a problem in setting up a test for them. As I mentioned earlier, Swift’s hashing is non-deterministic as of Swift 4.2, which means that if you run the same code twice, you’ll get two different hash values for the same object. That makes forcing a collision to happen really difficult.

We need a way to ensure that our test is perfectly reproducible and actually tests the thing we’re trying to test, which is a collision that has to occur every time we run this test. There two options here. First, you can set an environment variable called SWIFT_DETERMINISTIC_HASHING and find two values that happen to hash to the same bucket.

However, according to Karoy in the Swift forum thread announcing this new feature,

setting SWIFT_DETERMINISTIC_HASHING does not guarantee that hash values remain stable across different versions of the compiler/stdlib – please do not rely on specific hash values and do not save them across executions.

This means that we might have to update our tests every time the Swift toolchain is updated, and that’s not great. Instead, we can write a small type that can perfectly control what it hashes to. It needs fields for its underlying value and its hash value separately, so that a collision (two values that aren’t equal but have the same hash value) can be forced:

struct Hash: Hashable {

    let value: String

    let hashValue: Int

    init(value: String, hashValue: Int) {
        self.value = value
        self.hashValue = hashValue
    }

    static func == (lhs: Hash, rhs: Hash) -> Bool {
        return lhs.value == rhs.value
    }
}

(This type has a warning telling us not to use hashValue directly. Nothing can really be done about that.) Given this small type, we can now write a test with a forced collision:

func testACollision() {
    var dictionary = ProbingDictionary<Hash, String>()
    dictionary[Hash(value: "a", hashValue: 1)] = "123"
    dictionary[Hash(value: "b", hashValue: 1)] = "abc"
    XCTAssertEqual(dictionary[Hash(value: "a", hashValue: 1)], "123")
    XCTAssertEqual(dictionary[Hash(value: "b", hashValue: 1)], "abc")
}

The first and second items in the Dictionary have different values, but the same hash value, which will cause a collision. Running this test, we can observe the failure. The second item is overwriting the first. To fix this, we need to rewrite our get and set functions again.

Instead of returning the value at the hashed index, we need to check if it contains the right key. If it doesn’t, we need to step to the next value and check if it contains the right key. We keep doing this until we find the right key or hit an .empty element. Here’s an implementation that gets the job done.

get {
    let index = abs(key.hashValue % storage.count)
    return storage[index...]
        .prefix(while: { !$0.isEmpty })
        .first(where: { $0.key == key })?
        .value
}
set {
    if let newValue = newValue {
        let index = abs(key.hashValue % storage.count)
        if let insertionIndex = storage[index...].firstIndex(where: { $0.isEmpty || $0.key == key }) {
            storage[insertionIndex % storage.count] = .element(key, newValue)
        }
    }
}

It can be nice to notice that while the prefix + first(where:) in the get and the firstIndex(where:) look different, they can actually be both written the same. By rewriting the get to use .first(where: { $0.isEmpty || $0.key == key }), you can make the two blocks look the same, which suggests that there might be some shared code we can extract from here. We’ll come back to that soon.

Collisions Near the Rollover Boundary

Now that our test is passing, time to write our next failing test. Careful readers will note that we’re not handling the rollover case correctly. If the probing happens at the boundary where the storage array rolls over it’ll return nil even though there is a value in the dictionary at that spot. Let’s write a test for that.

func testARolloverCollision() {
    var dictionary = ProbingDictionary<Hash, String>()
    dictionary[Hash(value: "a", hashValue: 7)] = "123"
    dictionary[Hash(value: "b", hashValue: 7)] = "abc"
    XCTAssertEqual(dictionary[Hash(value: "a", hashValue: 7)], "123")
    XCTAssertEqual(dictionary[Hash(value: "b", hashValue: 7)], "abc")
}

We happen to know that our dictionary’s original size is 8, so we know to put the colliding elements at index 7 of the storage. However, to make this more robust, you might expose the size of the storage as an internal detail so your tests can rely on it but consumer code can’t. Either way, we have another failing test.

Fixing this one is kind of easy, but also kind of hard. Instead of just checking storage[index...], we need to follow it up with a check storage[..<index] as well. The naive solution for this is just to append the two together with a +.

let rotated = storage[index...] + storage[..<index]

This causes a major performance issue, which we have to address.

By examining the types, we can see what’s going on. rotated is an ArraySlice, which usually represents a view on an array — it stores the original array, a start index and an end index. However, this is not really one slice of an array, but rather two concatenated together, which means that because the internal storage of ArraySlice (the array and the two indices) can’t store the components of two different arrays, it will have to copy and modify one of them. Because ArraySlice is a copy-on-write type, this means that the whole storage array has to be copied into new storage when you execute that + operator. This has the ultimate effect of making every read from the dictionary be an O(n) operation, which is unacceptable. There are a few ways to fix this.

The apple/swift repo has a “Prototypes” folder which contains lots of interest and useful things that aren’t shipped with Swift or its standard library, but are useful nonetheless. Two of these are Concatenated and RotatedCollection. The first allows you to concatenate two collection together without copying, and the second enables you to “rotate” a collection around a pivot, which is exactly what we’re trying to do here. However, because we don’t have this type available to us, I’m going to take a different approach.

We discussed earlier that there’s a shared function we can extract from here, called firstOrEmptyIndex(matchingKey:), which finds the first index that matches our key or is empty. We can use that in both the get (grab the value of that Item, which return nil if it’s empty) and set (overwrite the key-value pair at that index). Let’s take a look at that function. We can add support for our new behavior here, checking all indices before the pivot as well.

func firstOrEmptyIndex(matchingKey key: Key) -> Int {
    let block: (Item) -> Bool = { $0.isEmpty || $0.key == key }
    let initialIndex = abs(key.hashValue % storage.count)
    let index = storage[initialIndex...].firstIndex(where: block)
        ?? storage[..<initialIndex].firstIndex(where: block)
     return index! % storage.count
}

Getting and setting become a lot simpler now:

subscript(key: Key) -> Value? {
    get {
        let index = firstOrEmptyIndex(forKey: key)
        return storage[index].value
    }
    set {
        if let newValue = newValue {
            let insertionIndex = firstOrEmptyIndex(forKey: key)
            storage[insertionIndex] = .element(key, newValue)
        }
    }
}

Tests are now passing. We can now handle rollover collisions.

At this point, I added a few more tests that didn’t require code changes. What if there’s a triple collision? What if there’s two or three objects in the dictionary that don’t collide? What about neighboring non-colliding objects? They all passed without requiring any code changes, so I’ll elide them here.

Resizing

The next feature I want to build is resizing. You might have noticed the force unwrap in that new function we just added. If none of our slots are .empty, we risk a crash. Resizing helps us ensure that there’s always free space in the dictionary. Again, test first! I think the best approach here is to store a large number of items in the dictionary (in this case 1000 items) and then assume that that’s enough items to force a resize.

func testResizing() {
    let letters = "abcdefghijklmnopqrstuvwxyz"
    func randomString() -> String {
        return String((0...10).compactMap({ _ in letters.randomElement() }))
    }
    let values = (1...1000).map({ (randomString(), $0) })
    var dictionary = ProbingDictionary<Hash, String>()
    for (key, hash) in values {
        dictionary[Hash(value: key, hashValue: hash)] = key
    }

    for (key, hash) in values {
        XCTAssertEqual(dictionary[Hash(value: key, hashValue: hash)], key)
    }
}

This test fails, so we know that it tests resizing correctly. Resizing is a pretty straightforward feature, and it’s very similar to how the resizing code in the previous blog post worked:

private mutating func resizeIfNeeded() {
    guard self.storage.filter({ $0.hasElement }).count > storage.count / 2 else { return }
    let originalStorage = self.storage
    self.storage = Array(repeating: .empty, count: originalStorage.count * 2)
    for case let .element(key, value) in originalStorage {
        self[key] = value
    }
}

Copy the storage, create a new storage that’s twice as big, and then copy all the elements into the new storage. I even got to use for case syntax in Swift, which I’ve never used before. If we call this method before trying to set any values, our new test will pass.

Collection Conformance

Let’s implement Collection conformance next. Instead of having a separate test for Collection stuff, we can just add an extra assert at the bottom of each test, asserting how many items should be in the Dictionary by the end of the test.

func testACollision() {
    var dictionary = ProbingDictionary<Hash, String>()
    dictionary[Hash(value: "a", hashValue: 1)] = "123"
    dictionary[Hash(value: "b", hashValue: 1)] = "abc"
    XCTAssertEqual(dictionary[Hash(value: "a", hashValue: 1)], "123")
    XCTAssertEqual(dictionary[Hash(value: "b", hashValue: 1)], "abc")
    XCTAssertEqual(dictionary.count, 2)
}

This is this simplest I could get the conformance to be, code-wise:

extension ProbingDictionary: Collection {
    var startIndex: Int {
        return storage.firstIndex(where: { $0.hasElement }) ?? storage.endIndex
    }

    var endIndex: Int {
        return storage.endIndex
    }

    func index(after i: Int) -> Int {
        return storage[i...].dropFirst().firstIndex(where: { $0.hasElement }) ?? endIndex
    }

    subscript(index: Int) -> (Key, Value) {
        return (storage[index].key!, storage[index].value!)
    }
}

While this works, it doesn’t quite meet the performance characteristics that Collection expects. startIndex should be an O(1) operation, but here it’s O(n), since it has to run through the storage array to find the first non-empty element. A better implementation would store the startIndex (and probably the current count, too) so that those things could be returned in O(1) time. The start index would store the first index in the storage array that is not .empty. Although it would add to the layout of the Dictionary, it’s probably worth it in order to deliver complexity guarantees that consumers would expect.

Now is also a fine time to throw in helpers for keys and values, which users also expect:

extension ProbingDictionary {
    var keys: AnySequence<Key> {
        return AnySequence(storage.lazy.compactMap({ $0.key }))
    }

    var values: AnySequence<Value> {
        return AnySequence(storage.lazy.compactMap({ $0.value }))
    }
}

Deletion

Okay. It’s time to write the code to handle deletion from the dictionary. I won’t lie to you — deletion with linear probing is hard. Think about it: you have to find the index (easy enough), replace it with .empty (fine), and then slide down every other element until the next .empty, unless the element is already in the lowest possible slot, in which case skip it and look for another one you can swap down (yikes!).

As always, test first:

func testDeletion() {
    var dictionary = ProbingDictionary<Hash, String>()
    dictionary[Hash(value: "a", hashValue: 7)] = "123"
    dictionary[Hash(value: "b", hashValue: 4)] = "abc"
    dictionary[Hash(value: "a", hashValue: 7)] = nil

    XCTAssertEqual(dictionary[Hash(value: "a", hashValue: 7)], nil)
    XCTAssertEqual(dictionary[Hash(value: "b", hashValue: 4)], "abc")
    XCTAssertEqual(dictionary.count, 1)
}

(Adding collection conformance is quite nice here, since it lets us check that size of the dictionary is accurate after the deletion.)

Writing this code frankly sucked. There’s no cute code that could dig me out of this one:

var deletionIndex = firstOrEmptyIndex(forKey: key)
repeat {
    let potentialNextIndex = (storage[deletionIndex...] + storage[..<deletionIndex])
        .prefix(while: { !$0.isEmpty })
        .dropFirst()
        .firstIndex(where: { abs($0.key!.hashValue % storage.count) <= deletionIndex })
    storage[deletionIndex] = .empty
    if let nextIndex = potentialNextIndex {
        let nextIndex = nextIndex % storage.count
        storage[deletionIndex] = storage[nextIndex]
        deletionIndex = nextIndex
    }
} while !storage[deletionIndex].isEmpty

Truly the stuff of nightmares. Does it work? I think so. It’s hard to be confident when the algorithm is this complicated. At this point, I added tests to cover the cases I could think of — deletion with neighboring non-colliding elements, deletion with colliding elements, deletion with neighboring elements that both collide and don’t collide, deleting elements that don’t exist, deleting an element and writing back to that same key, and so on. I’m fairly confident at this point that the code works, but I much prefer code that I can understand by reading, rather than code whose working state depends on what failures I can think of and write tests for.

Tombstones

Fortunately, there is a better way to handle deletion. Right now, we only have two cases in our Item enum: either it has an element, or it doesn’t. But if we could add a little bit of information to that, we could make a distinction between slots that have never held an element, and slots that used to hold an element but don’t anymore. We call these tombstones (💀), since they represent a marker for something that lived once, but is now dead. Tombstones help us because when we are doing a set, we can treat tombstones the same as empty elements (because they are writable slots) but when doing a get, we can treat them as a filled slot that doesn’t match our key, so we always pass over them and continue. This simplifies things greatly.

Our “first or empty” function needs to change its signature a bit:

func firstOrEmptyIndex(forKey key: Key, includeTombstones: Bool = false) -> Int

Since we now make a distinction between .tombstone and .empty when reading and writing, each of those two operations will pass a different value for this Bool.

This makes our reading, writing, and deleting code super simple:

subscript(key: Key) -> Value? {
    get {
        let index = firstOrEmptyIndex(forKey: key)
        return storage[index].value
    }
    set {
        resizeIfNeeded()
        if let newValue = newValue {
            let insertionIndex = firstOrEmptyIndex(forKey: key, includeTombstones: true)
            storage[insertionIndex] = .element(key, newValue)
        } else {
            let index = firstOrEmptyIndex(forKey: key)
            if !storage[index].isEmpty {
                storage[index] = .tombstone
            }
        }
    }
}

I like simple, because I can understand simple. All the tests still pass, and thus we know our refactor was successful.

As an aside, I’d like to look at our original Item enum again:

enum Item {
    case element(Key, Value)
    case empty
}

When you first see this type, it can be tempting to try to replace it with (Key, Value)?, since this type holds the same amount of information as the optional tuple. However, not only does a custom enum provide more information to the reader of the code, it also is more easily extended to add the .tombstone case. With the optional tuple, it’s tough to see how the tombstone can be added. Using an enum makes it easy it add that extra case and get to the refactored state.

Linear probing is a super interesting approach for building dictionaries. Given that this post is nearly 3000 words, it’s not an easy technique, but as far as I know, it’s the gold standard for fast, reliable dictionaries. It’s used in Python, Rust, Swift, and many other languages besides. Understanding how it works is key to appreciating the speed and structure of a dictionary.