Sometimes, while programming, we find ourselves in situations where a sorted collection of (key, value) pairs is needed. C# provides a class that implements such a sorted collection: the System.Collections.SortedList class. However, in some situations this standard class may not sufice for us. Because of some restrictions or special requirements we may have to implement our own sorted (key, value) collection.
A well known way to implement it is using Binary Trees. There is a lot of literature about Binary Trees, so we will not detail them here. They provide very good performace for operations that modify the collection and also for search operations.
Another, less known, way to implement a sorted collection is using a Skip List structure. The skip list structure is somehow more intuitive and easier to implement than Binary Trees, and it performs almost as good as Binary Trees. Thus, in situations where a fast implementation is needed, with fairly good performance, the Skip List structure can be used with success.
Overview of the Skip List structure
The Skip List structure is an extension of the Linear List structure, with the purpose to speed up the searching process.
A classical Linear List structure is composed of nodes where each node holds some useful information and also links towards the next node in the list. The nodes are kept in sorted order all the time. When a new node is inserted, it's place is found by comparing it with all nodes in the list, until either the end of the list is reached (in this case the new node goes at the end of the list) either a node with a greater value is found (in this case the new node is inserted right before the node with a greater value). It is possible that the new node is smaller than the first node of the list, and in this case it will go to the beginning of the list.
For an easier handling, we add two sentinel nodes. One sentinel node goes at the beginning of the list and holds a value that is smaller that any possible value for the nodes. The other sentinel goes at the end of the list and holds a value that is greater than any possible value for the nodes. Let's say that the first sentinel holds the -INF value and the second sentinel holds the +INF value.
With these, a linked list look something like this:
The nodes of the displayed list hold integer numbers as useful information. The numbers in the nodes are always sorted. Each node also holds a pointer towards the next node in the list. The last node does not hold any pointer (or we can say that it holds the NULL pointer).
The problem is that at every insertion we have to parse the list node by node in order to find out where to place the new node. This linear search has a O(n) complexity and, if many inserts are made, it seriously slows down the execution.
This is the point where the Skip List structure helps us. The Skip List structure improves the search process so that a search is performed almost as fast as in a Binary Tree. Let's see how this can be achieved.
Imagine that instead keeping only one linked list, we keep two. One of them holds all elements that are added, and the other one "skips" every second element. These two lists would look like:
Besides the number and the pointer towards the right neighbor, each node also holds a pointer towards its downward neighbor.
Now, imagine that we want to insert a number into the Skip List. We have to find the place where to insert the value so that we keep the numbers sorted. We can start searching in the top list, from the first sentinel (the -INF one) and go rightwards through the nodes, until we find a value that is greater than the one we want to insert. Because not all nodes are stored in the top list, we will move two times faster than we would move in the bottom list. When we find a node that is greater than the one we want to insert, we go down to the bottom list and cotinue searching there. We keep moving right until we find here too a node that is greater than the new one.
For example, let's suppose we want to insert element 1800. We start from the top list from -INF node and visit nodes 45, 468, 982. At this moment we realize that the next node (2114) is greater than the new one (1800) so we stop moving right and instead we go down to the bottom list. Here we continue with node 1684. At this moment we realize that the next node (2114) is again greater than 1800, so we stop. This is the place where we will add 1800.
If we count the steps that we did, we see that we visited 3 nodes in the top list and 1 nodes in the bottom list, thus a total of 4 nodes. If we would have done the same in the bottom list, we would have needed to visit 7 nodes. So we already have an improvement here.
But let's go on. We can add another list which holds only every fourth element. We would get a structure like this:
As we see, we now have three parallel lists. The nodes from the top list link towards the corresponding nodes in the list below it. And the nodes from the middle list link towards the corresponding nodes in the bottom list.
Now if we want to perform a search, it would be even faster. We start at the top list, then we get to the middle list and in the end we refine the search in the bottom list.
Let's consider again the situation when we want to insert element 1800. We start in the top list from node -INF and visit node 468. Then we realize that the next node 2114 is too great, so we go down to the middle list. Here we visit node 982, then we realize that again the next node (2114) is too great. So we go down to the bottom list and visit node 1684. At this moment we see that the next node is too great (2114), so this will be the place where 1800 should be inserted.
If we count the number of steps we did, we see that we had 1 node in the top list, 1 node in the middle list and 1 node in the bottom list, a total of 3 nodes. This is even better than before.
We could add another list which would hold one single node, but we can see that it would not be of more help than the list which holds two nodes. So we won't add another list. In general, if the bottom list holds N nodes, then we can add log(N) usefull lists. In our example N = 8 and we have log(8) = 3 lists.
In such a structure, every search is performed in log(n) time, exactly like in a binary tree.
The problem is that in practice we cannot build such a structure. In order to build such a structure, we need to know which is the element that represents the middle of the bottom list. This element has to be added to all N lists. Then we need to know the elements that represent the quarters of the bottom list. These elements have to be added to N-1 lists. And so on.
But because we are using linked lists, we cannot know which is the middle element, or which are the quarter elements.
So, in practice we use a structure that is generated probabilistically and performs very close to the ideal case.
How do we do it? It's simple. When we add a node, we have to make a decision: into how many of the paralel lists we insert the new node? We have several possibilities: we can add it to the bottom list, or the bottom list and the one above it, or to the bottom list and another two lists above it, or even to all lists. How do we decide this? If we would know that the added node is exactly the middle of the bottom list, then we would add it to all lists. If we would know that the added node is exactly the quarter of the bottom list, then we would add it to all lists except the top one. And so on. But we do not know. So we decide probabilistically, by flipping a coin.
We add the new node to the bottom list. This is certain. Then we flip a coin. If it falls on head, then we add the node to one list above. Then we flip the coin again. If it falls again on head, then we add the node to another list above. And so on. As long as the coin falls on head, we keep adding the node to lists. When if falls tail, we stop.
We have to also impose a limit to the maximum number of lists. If we knew N, the total number of elements, we could say the we keep log(N) lists. But we do not know the total number of elements, so we choose a value that we will think should be enough.
Let's take an example. We choose the maximum number of lists to be 6. We start with six empty lists.
And then we add nodes. The first node added is 11. We add it to the bottom list and then we start flipping the coin. The coin falls on head 3 times, so we add the node to another three lists above the bottom list.
Next node is 5. After findind the place where to insert it, we start flipping again. The coin flips head 5 times, so we add the node to the bottom list plus another 5 lists. We obtain:
Next node is 3. We flip the coin and it falls on head 0 times. So we add the node to the bottom list only. We obtain:
Next is 23. We flip the coin and it falls on head 1 time. So we add the node to the bottom list and also to one list above it. We obtain.
This structure was obtained using probabilities. It can be proved mathematically that its search performance in the average case is log(N), the same as for a Binary Tree.
When deleting a node, we first find it in the list, and then we delete it from the bottom list and also from the lists above. In order to make the deletion easier, it is good to store for each node a pointer towards the node above it.
Implementation in C#
The implementation is pretty straight ahead, without much difficulties.
In order to make the structure as general as possible, we do not use the -INF and +INF values, because sometimes it can be hard to determine those values. For example if our keys are string, then we cannot be sure which is are the -INF and the +INF values. Instead we will hold sentinels with NULL values and make the comparisons accordingly.
Here is a C# class that implements a Skip List structure. The code is well commented, so there should be no difficulty in understanding it: SkipList.cs.
We made a benchmark and compared our Skip List implementation with the SortedList class that C# provides. For 50,000 insertions, searches and removals, our SkipList class performs about 10 times faster than the standard SortedList class. So it may be worth using it in situations when speed is important. Here is the benchmark code: MainClass.cs.
It would be interesting to write some benchmark code and compare the performance of the SkipList class against a BinaryTree implementation class. But this can be the subject of another article.
Thank you and good luck with using Skip Lists. ;)