iosdev

Swift techniques to filter duplicates out of large JSON arrays

reduce, used properly, can be lightning fast. Here's where it yielded 200x improvement over nested for-in loop.

Imagine you have incomingJSONs, an array of 1000+ JSON objects like this:

struct Entity {
	let id: Int64
	var name: String
	var flags: [Int]
	...
}

In the array, some of the JSON objects may be duplicated where id is the same but contents of the other Entity properties (say flags) may be slightly different.

How would you remove those duplicates? End result should be a set of objects where each id appears only once.

Note: all measurements were done in the Simulator on fastest 2016 MacBook Pro. I expect that on some older iPhones differences would be larger.

Approach 1

Extract all IDs first (I’m using Marshal library here):

let jsonKey = "id"
let incomingIDs: Set<Int64> = Set(incomingJSONs.compactMap {
	try? $0.value(for: jsonKey) 
})

and then build a new array using filter().first:

if incomingIDs.count != incomingJSONs.count {
	var arr: [JSON] = []

	for id in incomingIDs {
		guard let jsonObj = incomingJSONs.filter({
			guard
				let jsonID: Int64 = try? $0.value(for: jsonKey)
			else { return false }

			return id == jsonID
		}).first else { continue }

		arr.append( jsonObj )
	}

	incomingJSONs = arr
}

This approach is something I expect someone new to Swift would do. In my example, this took ~8s to complete for about 1300 JSON objects.

This is very, very slow. The reason for this is obvious once you analyze what it’s doing: for each value of id, it’s processing each element in entire array only to take the first element from the resulting filter.

Approach 2

This is such a common pattern – filter a collection then take the first element from the result – that Swift has a more efficient algorithm in its API that does exactly that: first(where:)

if incomingIDs.count != incomingJSONs.count {
	var arr: [JSON] = []

	for id in incomingIDs {
		guard let jsonObj = incomingJSONs.first(where: {
			guard
				let jsonID: Int64 = try? $0.value(for: jsonKey)
			else { return false }

			return id == jsonID
		}) else { continue }

		arr.append( jsonObj )
	}

	incomingJSONs = arr
}

This gave me about 2x speed improvements in my (fairly limited) testing, finishing up in ~4s.

This shows just how on point Dave Abrahams is in Embracing Algorithms session (223) on WWDC 2018. Always look to see if there’s an existing algorithm that does what you need before you try to re-implement it.

However, 4s is still way too long, thus I poked deeper.

Approach 3

The third approach I tried is to build resulting array by using reduce and ever-smaller IDs set:

if incomingIDs.count != incomingJSONs.count {
	var leftoverIDs = incomingIDs

	let arr: [JSON] = incomingJSONs.reduce([]) {
		result, obj in
		guard
			let jsonID: Int64 = try? obj.value(for: jsonKey),
			leftoverIDs.contains(jsonID)
		else { return result }

		leftoverIDs.remove(jsonID)
		return result + [obj]
	}

	incomingJSONs = arr
}

This finished in 0.04s or ~100x faster than previous solution, 200x faster than what I started with.

···

The first approach using filter is sort-of correct one – filter is what I’m doing here – but the closure I used was very inefficient. I should not have used filter inside for-in; writing a better filter closure applied to the starting array is the correct way. I’m just not sure how to write it.

Third approach was me building my own filter algorithm. I extracted it into a method on Collection, which I could then immediately apply to all other similar JSON cleanups in the app thus improving app performance immensely.

extension Collection where Self.Element: MarshaledObject {
	func filter(duplicated id: String) -> [Element] {
		let incomingIDs: Set<Int64> = Set(self.compactMap { try? $0.value(for: id) })
		if incomingIDs.count == self.count {
			return self as! [Self.Element]
		}

		var leftoverIDs = incomingIDs
		let arr: [Element] = self.reduce([]) {
			result, obj in
			guard
				let jsonID: Int64 = try? obj.value(for: id),
				leftoverIDs.contains(jsonID)
			else { return result }

			leftoverIDs.remove(jsonID)
			return result + [obj]
		}

		return arr
	}
}

I don’t know nearly enough how Swift internally works to understand where is such a large improvement coming from. But I’ll get there.