Sparse Sets
In this post: I describe sparse sets for efficient storage of unsigned integers, followed by a C++ implementation, including custom iterators.
Useful background: C++ templates, classes.
When creating data structures and algorithms, there may appear to be a hard limit in their runtime complexity. For instance, searching a linear collection can only be done in no better than linear time O(n), and a comparison-based search no better than linearithmic time O(n log n). However, if we can make assumptions about special qualities possessed by our data, we can define more specialised and efficient behaviour; for example, if our linear collection is sorted we can perform a binary search in logarithmic time O(log n). We could use a binary search tree to store these sortable elements instead, or -- if the elements are hashable -- go one step further and store them in a hash map instead, achieving constant time outside of the worst-case.
The sparse set is a useful storage container designed to store unsigned integers. It is essentially a simplified concept of a hash map, allowing for guaranteed constant-time insertion and lookup. Clearing the set is also extremely efficient whilst removing individual elements, while not quite as efficient as the aforementioned operations, is also constant-time. In a modern setting in which a sparse set like this could be outperformed on the whole by a more sophisticated compressed bitset, like Roaring, clearing the set in constant time can form a significant niche for systems which need to do so very regularly. An additional benefit of the sparse set is that it provides a dense collection/iteration of all integers in the set 'for free', due to the way it stores them under the hood.
When an integer, x, is inserted, it is appended to array D. Its resultant index in D is then inserted into S at index x. For example, if inserting the integer 5 into the set in the image above, we would set D[3] to 5 and S[5] to 3.
To test if x is in the set, we simply need to test if D[S[x]] equals x. That's it! In order to append into D, we can keep track of the current size of the set, n (not the capacity!). We simply increment n for every insertion, and clear the set by setting n to zero: deleting the actual elements is thus not even necessary!
Finally, D already gives us a dense, linear collection of all elements currently in the set, making iteration very efficient.
is_unsigned is provided by <type_traits>, so don't forget to include it. This is just a simple addition to provide a much more concise and useful error messages from the compiler. Next, let's define our data layout:
We use two vectors for our two underlying structures. We also initialise our size, n, to zero, as well as our capacity, u. As you may appreciate already, the sparse set achieves constant time in a trade-off with space consumption linear to u, our universe of integers. If we want to store integers in the range 0-99 inclusive, both D and S must possess size 100.
Next, we begin to define our public operations, starting with the simplest ones:
As mentioned previously, we can clear the set by just resetting the size, without worrying about deleting any integers; the reasoning for this will become apparent when we implement the lookup operation.
The reserve method extends the capacity of the set to the specified u. This can be manually invoked by the user, but is also used by the insertion operation, similarly to the dynamic resizing of the vector. It's worth noting that we are calling resize on the vectors, not reserve. This is due to the fact that we must set the sizes of the vectors to u, not their capacities; we cannot insert an element at x if x-1 has not been filled, so we just fill the vectors with zeroes.
The implementation of the has method actually tests three predicates, rather than just the one aforementioned:
We only need to insert the value if it doesn't already exist. We must also increase the capacity of the set if it doesn't currently support the value. Finally, we can set our indices and increment the size.
Finally, although we can clear the whole set very easily, we probably also want to be able to remove individual values. This a tad stickier than the above operations but can still be done in constant time. To do this, we can use a trick commonly used for removal from linear containers wherein the order of elements does not actually matter: swap-and-pop. We swap the element to be removed with the element at the end of the list, then pop that element off the end:
This is the full implementation for the core sparse set container. Instantiate it with something such as SparseSet<unsigned> and test it out!
The first step is to define iterator and const_iterator member types. As our container will be iterating over a vector, it simply delegates these types. We cannot allow direct editing of the values in D, only observation, so we delegate both to const_iterator.
Note: The typename keyword is needed here just due to the use of a template parameter T as a dependent type; without it, the compiler incorrectly assumes it is a field, not a type. In addition, C++11 provides an alternative syntax to typedef: the type alias, demonstrated below. This is equivalent but preferable, not only due to its more intuitive appearance, but also due to the fact it supports templating.
After defining the iterator types, we must provide begin() and end() iterators. This is an easy task, as they merely delegate to the iterators of D; however, we must manually restrict the range to our inserted values, as D is actually filled to capacity with zeroes:
begin() simply provides an iterator to the beginning of D. We define end() to be begin() incremented by the number of contained values in D, i.e. the set's size. This is actually all we need to iterate over our container:
Test:
Output:
For situations in which unsigned integers must be stored with guaranteed constant-time access, lookup and clearing (a particular niche for the structure), the sparse set may prove very useful indeed.
Useful background: C++ templates, classes.
When creating data structures and algorithms, there may appear to be a hard limit in their runtime complexity. For instance, searching a linear collection can only be done in no better than linear time O(n), and a comparison-based search no better than linearithmic time O(n log n). However, if we can make assumptions about special qualities possessed by our data, we can define more specialised and efficient behaviour; for example, if our linear collection is sorted we can perform a binary search in logarithmic time O(log n). We could use a binary search tree to store these sortable elements instead, or -- if the elements are hashable -- go one step further and store them in a hash map instead, achieving constant time outside of the worst-case.
The sparse set is a useful storage container designed to store unsigned integers. It is essentially a simplified concept of a hash map, allowing for guaranteed constant-time insertion and lookup. Clearing the set is also extremely efficient whilst removing individual elements, while not quite as efficient as the aforementioned operations, is also constant-time. In a modern setting in which a sparse set like this could be outperformed on the whole by a more sophisticated compressed bitset, like Roaring, clearing the set in constant time can form a significant niche for systems which need to do so very regularly. An additional benefit of the sparse set is that it provides a dense collection/iteration of all integers in the set 'for free', due to the way it stores them under the hood.
Theory
The underlying secret of the structure is that it comprises two structures: a dense set, D, and a sparse set, S.When an integer, x, is inserted, it is appended to array D. Its resultant index in D is then inserted into S at index x. For example, if inserting the integer 5 into the set in the image above, we would set D[3] to 5 and S[5] to 3.
To test if x is in the set, we simply need to test if D[S[x]] equals x. That's it! In order to append into D, we can keep track of the current size of the set, n (not the capacity!). We simply increment n for every insertion, and clear the set by setting n to zero: deleting the actual elements is thus not even necessary!
Finally, D already gives us a dense, linear collection of all elements currently in the set, making iteration very efficient.
Core implementation
Let's implement a sparse set container in C++. The first thing to note is that we could simply assume our container is to store unsigned int, but this set is supposed to work with any unsigned integral type, so we should ideally make this a class template. For now, we can provide a useful way to ensure the value type is unsigned using static_assert, which tests and raises errors at compile-time.template <typename T> class SparseSet { static_assert(std::is_unsigned<T>::value, "SparseSet can only contain unsigned integers");
is_unsigned is provided by <type_traits>, so don't forget to include it. This is just a simple addition to provide a much more concise and useful error messages from the compiler. Next, let's define our data layout:
private: std::vector<T> dense; //Dense set of elements std::vector<T> sparse; //Map of elements to dense set indices size_t size_ = 0; //Current size (number of elements) size_t capacity_ = 0; //Current capacity (maximum value + 1)
We use two vectors for our two underlying structures. We also initialise our size, n, to zero, as well as our capacity, u. As you may appreciate already, the sparse set achieves constant time in a trade-off with space consumption linear to u, our universe of integers. If we want to store integers in the range 0-99 inclusive, both D and S must possess size 100.
Next, we begin to define our public operations, starting with the simplest ones:
public: size_t size() const { return size_; } size_t capacity() const { return capacity_; } bool empty() const { return size_ == 0; } void clear() { size_ = 0; }
As mentioned previously, we can clear the set by just resetting the size, without worrying about deleting any integers; the reasoning for this will become apparent when we implement the lookup operation.
void reserve(size_t u) { if (u > capacity_) { dense.resize(u, 0); sparse.resize(u, 0); capacity_ = u; } }
The reserve method extends the capacity of the set to the specified u. This can be manually invoked by the user, but is also used by the insertion operation, similarly to the dynamic resizing of the vector. It's worth noting that we are calling resize on the vectors, not reserve. This is due to the fact that we must set the sizes of the vectors to u, not their capacities; we cannot insert an element at x if x-1 has not been filled, so we just fill the vectors with zeroes.
bool has(const T &val) const { return val < capacity_ && sparse[val] < size_ && dense[sparse[val]] == val; }
The implementation of the has method actually tests three predicates, rather than just the one aforementioned:
- If x is not less than the capacity u, it cannot exist.
- If the index suggested by S[x] is not less than the size n, it cannot exist.
- If D[S[x]] is equal to x, it exists.
void insert(const T &val) { if (!has(val)) { if (val >= capacity_) reserve(val + 1); dense[size_] = val; sparse[val] = size_; ++size_; } }
We only need to insert the value if it doesn't already exist. We must also increase the capacity of the set if it doesn't currently support the value. Finally, we can set our indices and increment the size.
Finally, although we can clear the whole set very easily, we probably also want to be able to remove individual values. This a tad stickier than the above operations but can still be done in constant time. To do this, we can use a trick commonly used for removal from linear containers wherein the order of elements does not actually matter: swap-and-pop. We swap the element to be removed with the element at the end of the list, then pop that element off the end:
void erase(const T &val) { if (has(val)) { dense[sparse[val]] = dense[size_ - 1]; sparse[dense[size_ - 1]] = sparse[val]; --size_; } }
This is the full implementation for the core sparse set container. Instantiate it with something such as SparseSet<unsigned> and test it out!
Iteration
As previously mentioned, the guaranteed constant-time operations are not the only benefit of the sparse set: the array D provides a list of elements in the order there were added (unless elements were moved around due to the erase method). This is a very useful feature, so let's expose it by defining iterators for our container.public: typedef typename std::vector<T>::const_iterator iterator; typedef typename std::vector<T>::const_iterator const_iterator;
The first step is to define iterator and const_iterator member types. As our container will be iterating over a vector, it simply delegates these types. We cannot allow direct editing of the values in D, only observation, so we delegate both to const_iterator.
Note: The typename keyword is needed here just due to the use of a template parameter T as a dependent type; without it, the compiler incorrectly assumes it is a field, not a type. In addition, C++11 provides an alternative syntax to typedef: the type alias, demonstrated below. This is equivalent but preferable, not only due to its more intuitive appearance, but also due to the fact it supports templating.
using iterator = typename std::vector<T>::const_iterator; using const_iterator = typename std::vector<T>::const_iterator;
After defining the iterator types, we must provide begin() and end() iterators. This is an easy task, as they merely delegate to the iterators of D; however, we must manually restrict the range to our inserted values, as D is actually filled to capacity with zeroes:
iterator begin() { return dense.begin(); } const_iterator begin() const { return dense.begin(); } iterator end() { return dense.begin() + size_; } const_iterator end() const { return dense.begin() + size_; }
begin() simply provides an iterator to the beginning of D. We define end() to be begin() incremented by the number of contained values in D, i.e. the set's size. This is actually all we need to iterate over our container:
Test:
SparseSet<unsigned> ss; ss.insert(5); ss.insert(1000); ss.insert(54); ss.erase(5); ss.insert(28); //Range-based for loop for (auto &x : ss) cout << x << " "; cout << endl; //Equivalent to: for (auto it = ss.begin(); it != ss.end(); ++it) cout << *it << " "; cout << endl;
54 1000 28
Conclusion
The sparse set is a slightly obscure but highly useful data structure for unsigned integers wherein efficiency in both insertion/lookup and iteration is desired. It highlights the tradeoff between time and space complexity very nicely: for large integers, the space required can grow very large. If, for example, you want to store integers in the range 200000-300000, the 0-200000 capacity of both D and S is entirely unnecessary. In this case, one can simply normalise the values: to store a value x wherein 200000 < x < 300000, just offset the calls such as ss.insert(x - 200000).For situations in which unsigned integers must be stored with guaranteed constant-time access, lookup and clearing (a particular niche for the structure), the sparse set may prove very useful indeed.
Source code
Citations
- Briggs, Preston, and Linda Torczon. "An efficient representation for sparse sets." ACM Letters on Programming Languages and Systems (LOPLAS) 2.1-4 (1993): 59-69
Comments
Post a Comment