A normal map lets us look up items based on an exact match for the key, but here is a situation where we want to match on a superset of the search key. We generate some expensive objects related to sets of tags. For example:

{ "foo", "bar", "baz" } => #{expensive object 0}
{ "apple", "cat" }      => #{expensive object 1}
{ "dog" }               => #{expensive object 2}


Example lookups:

{ "foo", "bar", "baz" } => #{expensive object 0}
{ "bar", "baz" }        => #{expensive object 0}
{ "apple }              => #{expensive object 1}
{ "foo", "dog" }        => none


The simplest way to implement this is to iterate through a [ ([String], ExpensiveObject) ] and check for list containment. This is very slow.

A better idea is to iterate isSubsetOf over a list of sets. This can be fairly quick because Data.Set is implemented using balanced binary trees and simple heuristics remove expensive recursive calls in isSubsetOf:

Another idea is to augment a trie to allow for key-superset search. The paper Index Data Structure for Fast Subset and Superset Queries (Savnik) shows how to do this.

Following the paper we can implement a superset trie like so:

The Maybe b is a place to store an ExpensiveObject, and the Map gives the children of this node.

Inserting into the STrie is essentially the same as in a normal trie but we first sort the elements of the set. Searching also takes advantage of the sorted search key:

Here are some benchmarks. I used criterion and re-used the generators I defined for the QuickCheck tests.

benchmarking subset-find/subset-trie
time                 495.7 ms   (476.1 ms .. 537.9 ms)
0.999 R²   (0.998 R² .. 1.000 R²)
mean                 481.1 ms   (477.2 ms .. 488.8 ms)
std dev              7.641 ms   (94.67 μs .. 9.084 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking subset-find/Data.Set
time                 2.606 s    (2.127 s .. 3.647 s)
0.981 R²   (0.967 R² .. 1.000 R²)
mean                 2.687 s    (2.476 s .. 2.860 s)
std dev              231.5 ms   (120.8 ms .. 314.1 ms)
variance introduced by outliers: 22% (moderately inflated)

benchmarking subset-find/naive-lists
time                 8.774 s    (6.096 s .. 9.930 s)
0.989 R²   (0.972 R² .. 1.000 R²)
mean                 10.22 s    (9.482 s .. 10.65 s)
std dev              724.3 ms   (294.8 ms .. 995.7 ms)
variance introduced by outliers: 20% (moderately inflated)


Full source: https://github.com/carlohamalainen/superset-trie