A normal map lets us look up items based on an exact match for the key, but here is a situation where we want to match on a superset of the search key. We generate some expensive objects related to sets of tags. For example:

{ "foo", "bar", "baz" } => #{expensive object 0}
{ "apple", "cat" }      => #{expensive object 1}
{ "dog" }               => #{expensive object 2}


Example lookups:

{ "foo", "bar", "baz" } => #{expensive object 0}
{ "bar", "baz" }        => #{expensive object 0}
{ "apple }              => #{expensive object 1}
{ "foo", "dog" }        => none


The simplest way to implement this is to iterate through a [([String], ExpensiveObject)] and check for list containment. This is very slow.

A better idea is to iterate isSubsetOf over a list of sets. This can be fairly quick because Data.Set is implemented using balanced binary trees and simple heuristics remove expensive recursive calls in isSubsetOf:



isSubsetOfX :: Ord a => Set a -> Set a -> Bool
isSubsetOfX Tip _ = True
isSubsetOfX _ Tip = False
-- Skip the final split when we hit a singleton.
isSubsetOfX (Bin 1 x _ _) t = member x t
isSubsetOfX (Bin _ x l r) t
= found &&
-- Cheap size checks can sometimes save expensive recursive calls when the
-- result will be False. Suppose we check whether [1..10] (with root 4) is
-- a subset of [0..9]. After the first split, we have to check if [1..3] is
-- a subset of [0..3] and if [5..10] is a subset of [5..9]. But we can bail
-- immediately because size [5..10] > size [5..9].


Another idea is to augment a trie to allow for key-superset search. The paper Index Data Structure for Fast Subset and Superset Queries (Savnik) shows how to do this.

Following the paper we can implement a superset trie like so:

data STrie a b = STrie (Maybe b) (M.Map a (STrie a b))


The Maybe b is a place to store an ExpensiveObject, and the Map gives the children of this node.

Inserting into the STrie is essentially the same as in a normal trie but we first sort the elements of the set. Searching also takes advantage of the sorted search key:

member :: (Ord a) => [a] -> STrie a b -> Bool
member [] _ = False
member k strie = member' (sortNub k) strie
where
member' [] (STrie _ _) = True

member' (x:xs) (STrie _ nodes) = with || without
where

-- (1) Use x, which means finding a child trie labeled x.
with = Just True == (member' xs <$> M.lookup x nodes) -- (2) Skip x, which means finding some child tree where -- the recursive call works. Since x:xs is sorted and the -- values inserted into the trie are sorted, we don't -- need to look at values where x is >= the label. -- -- For example if x = "foo" and the children are -- -- "bar", "egg", "zed" -- -- then there's no point following the "zed" sub-trie -- since we will never find a "foo" there, due -- to the fact that "zed" > "foo". nodes' = M.filterWithKey (\k _ -> k < x) nodes without = any (member' (x:xs) . snd)$ M.toList nodes'


Here are some benchmarks. I used criterion and re-used the generators I defined for the QuickCheck tests.

benchmarking subset-find/subset-trie
time                 495.7 ms   (476.1 ms .. 537.9 ms)
0.999 R²   (0.998 R² .. 1.000 R²)
mean                 481.1 ms   (477.2 ms .. 488.8 ms)
std dev              7.641 ms   (94.67 μs .. 9.084 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking subset-find/Data.Set
time                 2.606 s    (2.127 s .. 3.647 s)
0.981 R²   (0.967 R² .. 1.000 R²)
mean                 2.687 s    (2.476 s .. 2.860 s)
std dev              231.5 ms   (120.8 ms .. 314.1 ms)
variance introduced by outliers: 22% (moderately inflated)

benchmarking subset-find/naive-lists
time                 8.774 s    (6.096 s .. 9.930 s)
0.989 R²   (0.972 R² .. 1.000 R²)
mean                 10.22 s    (9.482 s .. 10.65 s)
std dev              724.3 ms   (294.8 ms .. 995.7 ms)
variance introduced by outliers: 20% (moderately inflated)


Full source: https://github.com/carlohamalainen/superset-trie