Nearest neighbours with a subset structure

2022-05-02

scicomp stackexchange

An answer to this question on the Scientific Computing Stack Exchange.

Question

I have a set of points of $\mathbb{R}^2$ which is organised in subsets: $$ \cup_{0\leq i<N} \left{ P_i^j \in \mathbb{R}^2, 0\leq j<M \right} $$ For all $(i,j)$, I want to find $P_k^l$ in another subset (with $k\not= i$) so that $d(P_k^l,P_i^j)$ is minimal. And "of course" (*), many points with $k=i$ are in general much closer to $P_i^j$ than points with $k\not=i$ (**).

Does that evoke any well known variant of nearest neighbour search problems?

If I use a standard R-tree approach, the leaves of search tree will be full of useless $P_i^l$ points, for which a costly distance calculation will be performed, and I will then have to filter them out. Of course this shouldn't alter the logarithmic cost in $O(\log(NM))$ if leaf size is appropriate, but I expect the constant to be much larger than with a tailored algorithm. Before deciding whether to code my own implementation, I would like to make sure that there aren't classical algorithms that would handle this case efficiently.

Notes: $N$ is of order 10, while $M$ of order $10^3$ to $10^4$. I'm using python, currently with scikit learn's nearest neighbour search. (*) Not because it's natural, just because else there wouldn't be a reason to look for a variant algorithm. (**) Estimate is that about 40 points $P_i^l$ will in general be closer to $P_i^j$ than any $P_{k\not=i}^m$ for usual meshing, and this number will increase linearly with mesh refinement. This is because points of a same set discretise a curve of $\mathbb{R}^2$, and the $N$ curves don't intersect or touch.

Answer

Make a separate R-tree for each subset. Your NN lookup is then N log(M). If the number of falsely proximal points exceeds N then you might have a time savings versus a single R-tree customized to discriminate by subset. You can also parallelize across N.