Khanlou | Bloom Filters and Swift 4.2

September 19, 2018

Bloom Filters and Swift 4.2

Swift 4.2 brings some new changes to how hashing works in Swift. Previously, hashing was fully delegated to the object itself. You would ask an object for its hashValue, and it would respond with the fully cooked integer representing the hash. Now, objects that conform to Hashable describe how to combine their constituent components into a Hasher object that’s passed in as a parameter. This is great for a few reasons.

Good hashing algorithms are hard to write. End users of Swift shouldn’t be responsible for knowing the best way to combine their constituent components into a good hash value.
Hash values should be different for each execution of the program, to discourage users from storing them in any way and for security reasons. Descriptive hashing allows the seeds of the hashing function to change each time you run the program.
It makes more interesting data structures possible, which is what we’re going to focus on in this post.

I’ve written here before about how to use Swift’s Hashable protocol to write a Dictionary from scratch (and it will help to read that post before this one, but isn’t necessary). Today, I want to talk about a different kind of data structure, one that’s probabilistic rather than definite: the Bloom filter. As we build it, we’ll use a few new features in Swift 4.2, including the new hashing model.

Bloom filters are weird. Imagine a data structure where:

you can insert values into it
you can query it to see if a value is contained in it
it can use a tiny amount of storage to a store a huge amount of objects

but:

you can’t enumerate the objects in it
it sometimes has false positives (but never false negatives)
you can’t remove values from them

When would you want something like this? Medium uses them to keep track of the read status of posts. Bing uses them for search indexes. You could use them to build a cache for if a username is valid (in an @-mention, for example) without having to hit the database. They’re particularly useful on the server, where you could have immense scale but not necessarily immense resources.

(If you’ve ever done any graphics work, you might be wondering how this relates to the other bloom filter. The answer is, it doesn’t. That bloom filter is a lowercase b bloom, like a flower. This one is named after a person named Bloom. Total coincidence.)

How do they work?

Putting an object into a Bloom filter is sort of like putting it into a set or a dictionary: you need to calculate a hash value for the object and mod the hash value by the size of your storage array. At that point, instead of storing the value at that position in the index (like you would with a set or a dictionary), you simply flip a bit at that index: change a value from false to true.

Let’s take a simple example to understand how the filter works, and then we’ll crank it up to make it web scale. Imagine a boolean array (sometimes called a bit array) of 8 false values:

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---------------------------------
|   |   |   |   |   |   |   |   |

This represents our Bloom filter. I’d like to insert the string “soroush” into it. The string hashes to 9192644045266971309, which, when modded by 8, yields 5. I’ll flip the bit at that index.

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---------------------------------
|   |   |   |   |   | * |   |   |

Next, I’d like to insert the string, “swift”. It yields a hash of 7052914221234348788, which we then mod by 8: flip the bit at index 4.

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---------------------------------
|   |   |   |   | * | * |   |   |

To test if the Bloom filter contains the string “soroush”, I hash and mod again, yielding 5 again, which is set to true. “soroush” is indeed in the Bloom filter.

You can’t just test the cases where your thing works though, so let’s write some negative tests. Testing the string “khanlou” yields an index of 2, so we know it’s not in the bloom filter. So far, so good. Next test: the string “hashable” yields an index of 5, which…causes a collision! It returns true even though the value was never inserted into the Bloom filter. This is an example of a Bloom filter giving you a false positive.

Since we want to minimize these false positives, there are two main strategies we can use. First, and the more interesting of the two: instead of hashing the value once, we can hash it twice, or three times, with different hashing functions. As long as they are all well-behaved hashing functions (evenly distributed, consistent, minimal collisions), we can generate multiple indices to flip for each value. Now, when we hash “soroush”, we do it twice, and generate two indices, and flip the bools there. But now, when we go to check if “khanlou” is included in the Bloom filter, one of its hashes will collide with one of the hashes for “soroush”, but the chance that both of its hashes will collide has become smaller. You can scale this number up. I’ll use the 3 hashes in the code below, but you could use more.

Of course, if you use more hashes, each element will take up more space in the bit array. However, we’re using practically nothing right now. Our data array is 8 booleans (or bits) which is 1 byte. This is second tactic for minimizing false positives: scale the bit array up. We can make it pretty big before we even start noticing it. We’ll use a default of 512 bits for the code example below.

Now, even using both of these strategies, you’ll still get collisions, and thus false positives, but you’ll get fewer of them. This is one of the downsides of using Bloom filters, but in the right circumstances, the speed and space savings are a worthwhile tradeoff.

I want to touch on three other things to note before we get into code. First, you can’t resize a Bloom filter. When you mod the hash value, you’re destroying information, and you can’t get that information back unless you keep the original hashes around — defeating the dramatic space savings you can get with this data structure.

Secondly, you can see how it would be impossible to enumerate the elements in a Bloom filter. You don’t have them anymore, just some of their artifacts (in the form of hashes).

Lastly, you can also see how you can’t remove elements from a Bloom filter. If you try to flip the bits back to false, you don’t know who had originally flipped them to true. The value you’re trying to remove, or a different value? You’d start causing false negatives as well as false positives. (This may actually be a worthwhile tradeoff for you!) You could get around this by keeping a count at each index instead of a boolean, which would also have storage implications, but again, could be worthwhile depending on your use case.

So, let’s get to some code. There are some choices that I’ll make here that you might choose to make differently, starting with whether or not to make the object generic. I think it makes sense to have the object have a little more metadata about what it’s supposed to be storing, but if you find that too restrictive, you can change it.

struct BloomFilter<Element: Hashable> {
	// ...
}

We need to store two main things in this type. First, data, which represents our bit array. This will store all the flags that correspond to the hashes:

private var data: [Bool]

Next, we need our different hash functions. Some Bloom filters use actually different methods of computing the hash, but I think it’s easier to use the same algorithm, but mix in one additional randomly generated value.

private let seeds: [Int]

When setting up the Bloom filter, we need to set up both instance variables. Our bit array is going to simply repeat the false value, and our seeds will use Swift 4.2’s new Int.random API to generate the seeds we need.

init(size: Int, hashCount: Int) {
	data = Array(repeating: false, count: size)
	seeds = (0..<hashCount).map({ _ in Int.random(in: 0..<Int.max) })
}

I also included a convenience initializer with some nice defaults.

init() {
	self.init(size: 512, hashCount: 3)
}

There are two main functions we need to write: insert and contains. They both need to take the element and calculate one hash value for each of the seeds. Great fodder for a private helper function. We need to create one hash value for each seed, because our seed values represent the “different” hashing functions.

private func hashes(for element: Element) -> [Int] {
	return seeds.map({ seed -> Int in
		// ...
	})
}

To write the body of this function, we need to create a Hasher (new in Swift 4.2), and pass in the object we want to hash. Mixing in the seed as well ensures that the hashes won’t collide.

private func hashes(for element: Element) -> [Int] {
	return seeds.map({ seed -> Int in
		var hasher = Hasher()
		hasher.combine(element)
		hasher.combine(seed)
		let hashValue = abs(hasher.finalize())
		return hashValue
	})
}

Also, note the absolute value on the hash value. Hashes can come back as negative, which will crash our array access later. This removes one bit of entropy, which is probably fine.

Ideally, you’d be able to initialize the Hasher with a seed, instead of mixing it in. Swift’s Hasher uses a different seed for each launch of the application (unless you set an environment variable which they added for consistent hashing between launch, mostly for testing purposes), meaning you can’t write these values to disk. If we controlled the seed of the Hasher, then we could write these values to disk as well. As this Bloom filter currently stands, it should only be used for in-memory caches.

The hashes(for:) function does a lot of the heavy lifting, and makes our insert method very straightforward:

mutating func insert(_ element: Element) {
	hashes(for: element)
		.forEach({ hash in
			data[hash % data.count] = true
		})
}

Generate all the hashes, mod each one by the length of the data array, and set the bit at that index to true.

contains is likewise simple, and gives us a chance to use another new Swift 4.2 feature, allSatisfy. This new function tells you if all of the objects in the sequence pass some test, represented by a block:

func contains(_ element: Element) -> Bool {
	return hashes(for: element)
		.allSatisfy({ hash in
			data[hash % data.count]
		})
}

Because the result of data[hash % size] is already a boolean, it fits quite nicely into allSatisfy.

You can also add an isEmpty method that checks if all of the values in the data are false.

Bloom filters are one of the weirder data structures. Most of the data structures we deal with are deterministic. When you put an object in a dictionary, you know it’ll be there later. Bloom filters, on the other hand, are probabilistic, sacrificing certainty for space and speed. Bloom filters aren’t something that you’ll use every day, but when you do need them, you’ll be happy to know they exist.