Geographical Dispersion Heuristic

2015-06-29

mathematical-optimization stackexchange stackoverflow

An answer to this question on Stack Overflow.

Question

I am trying to implement a hill climbing algorithm to decide which locations to choose from a set of locations based on specific criteria. There are up to 5000 locations to choose from.

One of these criteria is geographic dispersion, thus I need to be able to assign any subset of my locations a value representing dispersion.

Each location has latitude and longitude data.

Speed is an issue and that is why I need some heuristic that will estimate how dispersed a specific set of locations (i.e a possible solution) is.

I have tried summing the pairwise distances of each of the locations in my potential solution but this proves too slow.

I then tried the sum of the distances from the centre of all the locations in my potential solution, this proved faster but doesn't work as effectively. Using this approach will favour a few clusters of locations.

Any other suggestions will be greatly appreciated.

Answer

Consider three situations:

All your nodes are in a dense cluster.
All of your nodes are in a cluster, but the cluster is wide.
Many of your nodes are in a cluster, but a few nodes are located far from the cluster.

Your sum across all pairwise distances captures (1) and (2) well: close clusters give smaller results than large clusters. How does it do for (3)? Here, a proportion e of the total number of nodes N are far away, at an average distance D. The other (1-e)N nodes are a clustered at an average distance of d.

Now, how many pairwise connections must be considered? For the clustered nodes, there are ((1-e)N)^2=e^2*N^2-2*e*N^2+N^2 such connections. For distant nodes, there are e^2*N^2 connections.

Now, multiply these values by the average distances. This gives a total pairwise average of (d*(e^2*N^2-2*e*N^2+N^2)+D*(e^2*N^2))/N. Now, assume that e is small, we can then neglect terms incorporating e^2. Thus, the average is d*(N^2-2*e*N^2)/N.

Now, consider your second metric: everyone's distance from the average center point. This does fine on (1) and (2) as well: close clusters have smaller results than larger clusters. How does it do on 3? Use the same e as above to represent the proportion of outliers. Now, the average distance of the nodes from a central point is given by (d*(1-e)*N+D*e*N)/N. In other words, the clustered nodes are no longer weighted as heavily.

Is there a way you can have a lightweight computation and still weight the clustered nodes more appropriately? I think so.

My suggestion is that you choose random pairs of nodes from your list, calculate the internode distance, and then average over the results. For (1) tight clusters, all the nodes will be close together, so all of the random pairs you draw will be close to the pairwise average you would have got with your calculation. For (2), loose clusters, the same will be true. For (3), a cluster with outliers, you are more likely to draw nodes from within the cluster than from without, so the outliers end up getting ignored.

As you increase the number of nodes sampled, the cluster will tend to dominate the random sampling. I am guessing this will provide decent accuracy with O(N) rather than O(N^2) time.