Some thoughts on LSH

This week I have been working on testing different ANN-libraries, reading documentation and understanding what all this ANN-search is about, specifically Locality-sensitive hashing (LSH). In this post I will try to explain which are the bases of LSH and resume the different capabilities of the most important libraries in the field.

How does LSH work?

The idea behind Locality-sensitive hashing (LSH) is to find similar data structures (3D brain maps, in this case). The fundamental idea is that similar brain maps will have similar hash-codes. For this purpose, K-bit hash-codes are generated, that allow us to insert a brain map into a hash-table that has several buckets (with the idea that having similar hash code, an image will end up in the same bucket).

The main problem is that having similar hash-code is incompatible with falling in the same bucket by definition. The way to get rid of this is to repeat L times the procedure of indexing with different hash-tables / hash-codes (randomised process). This is an essential step to avoid missing potential neighbours.

How is this accomplished?

So, the goal is to detect the similar hash-codes for nearby points (nearby in the sense of a given distance). We first have to generate a series of random hyperplanes through our space of points. These planes will divide the data in up and below the plane. (0|1)! This procedure divides the space in 2^k buckets and each bucket will have a binary hash that represents that bucket. As the hyperplanes are randomly generated, this procedure has to be repeated several times (L times = number of hash tables). When completed, a intersection procedure has to be calculated to detect points that are close, but could have been in separate buckets by chance in some of the buckets.

Which is the computational cost for LSH?

If we have:

N points (brain images).
D dimensions of the data (~28000 fur us, using reduced representation of the images. Could be less if any dimensionality reduction algorithm was applied).
K hyperplanes (or bit-codes) and…
L hash-tables

We turn into:

We need DK operations to find the bucket where apoint lands on.
N/2^K expected number of collisions.
DN/2^K cost of comparisons (the more the Ks, the less the cost).
But, we have to repeat everything L times:

so…

LSH = LDK + LDN/2^K → O(log N) (if K ~ log N) else, if K is fixed, O(1)
Indexing = D (ND)/sqrt(ND) → O(sqrt(N))

to make an idea, Brute-force solving of the problem would be = DN → O(N).

You can find more considerations on the LSH problem in this Youtube’s playlist.

Libraries comparison

We have several considerations with respect to the different libraries; At the moment we are using as vectors a ~28000 dimensions reduced representation of the images, which is quite high dimensional (we would probably have to reduce this representation to a lower dimensional solution, but while the bench times keep being efficient/small, we won’t). Other important considerations are: if it is possible to index new vectors after building the engine or how to delete previously indexed image and where to save the ANN engine (Redis, Memory… this could be a major challenge, plus we have to bare in mind the RAM consumption).

I have prepared a table that summarizes the main characteristics of the different candidate libraries for this project. As it can be seen, no one of them fulfills our needs completely, so it is possible that we have to balance pros and cons or we decide to make changes to the ones that best fit us in order to complete meet our needs.

Table 1. Comparison of different ANN-libraries
Libraries	Customization (options)	Index after fitting	Delete indexes	Memory storage	Bench	Lang
Nearpy	Medium	Yes	No (deletes buckets)	Redis / Mongo / Pickle / RAM	Medium - Poor	Python
RPForest	Low (gives no distance back)	Yes	No	Pickle	Medium - Poor	Python / C++
FALCONN	High	No ?	No ?	Unknown	High	C++ (Python wrapper)
NMSLIB	High	No	No	HD / Query server option (Apache)	High	C++ (Python wrapper)
Annoy	Medium	No	No	HD	High - Medium	Python / C++
FLANN	Medium	No	No	No ?	High - Medium	C++ (Python wrapper)
Panns	Medium - Poor	No	No	HD	Medium - Poor	Python
scikit-learn	Medium - Poor (Several Algorithms)	No	No	No ?	Poor	Python
KGraph	High-Medium	No	No	HD	Poor	C++

For the next week I will work on finding a way of testing the performance of the libraries in terms of accuracy at the time of evaluating "similarity". This is a major problem, since I have some questions about what is to be "similar" for 2 given brainmaps.