Optimal line such that maximum points are between an upper and lower boundary

2019-10-27

An answer to this question on the Scientific Computing Stack Exchange.

Question

I have some 2D data and would like to find a line $y = mx + b$ such that a maximum number of points from the data is captured within the area between $y = mx + b + margin$ and $y = mx + b - margin$. Please see the image for an example (the yellow lines are boundaries, where margin = 20).

If it helps, the values on the y-axis are necessarily integers.

Currently, I have used linear regression (from scikit-learn) to find a line. However, this is not guaranteed to put the most points between the boundaries. I have also looked a bit into support vector regression (SVR), which seems to try to put every point within a certain epsilon of the line, but I'm not sure if this is the right application or if SVR is overkill.

Any Python code would be appreciated, but more generally I would just like a conceptual solution. Does anyone have any insights that would help solve this problem? Thank you!

Edit: the number of points $n$ is in the range of 5000-10000, and theoretically I only need to compute this once.

Answer

You can approximate the answer in $O(N log N + kN)$ time where $k$ is the number of sample lines you wish to test. To do so:

Choose a uniform set of $k$ points $\theta$ spanning the range [0,π). These are your sample lines.
Do a loop starting with the line for $\theta=0$. ($O(N)$)
Get a unit vector $u$ in the direction of $\theta$. ($O(N)$)
Get a vector $\hat u$ perpendicular to this vector. ($O(N)$)
Project all the points onto $\hat u$. ($O(N)$)
Sort the points along $\hat u$. ($O(N)$)
Slide a window the size of your margin along the sorted points and note when the maximal number of points are within this window along with the midpoint of the window when this occurs. ($O(N)$)
Return information about the line which had the maximum point count.

The algorithm takes $O(N log N)$ time for initially sorting the points. However, after this we are projecting the points onto a slowly rotating line. The slow rotation means that each reprojection results in the points already being very close to their previously sorted positions:

Most sorting algorithms are able to deal with this situation in nearly-linear time. Note also that as the number of samples increases, the degree to which the points are already sorted increases as well.

The full algorithm looks like this:

#include <algorithm>
#include <cassert>
#include <cmath>
#include <functional>
#include <iostream>
#include <random>
#include <vector>
class FracPoint {
 public:
  float frac = std::numeric_limits<float>::quiet_NaN();
  float x;
  float y;
  FracPoint(float x0, float y0) : x(x0), y(y0) {}
  bool operator<(const FracPoint &o) const {
    return frac<o.frac;
  }
  bool operator>(const FracPoint &o) const {
    return frac>o.frac;
  }
};
struct BestLine {
  int    count;
  double slope;
  double intercept;
};
typedef std::vector<FracPoint> Points;
typedef std::pair<int,double> CountIntercept;
///Projects points to their closest orthogonal point on a line at an angle theta
///to the x-axis. Sets each point's `frac` property to the number of unit
///vectors their projection is from the origin. This can be used to sort points
///along the length of the line.
void GetFractionalDistPointsProjectedToVector(
  Points &pts,
  const double theta
){
  const double tvx = std::cos(theta);
  const double tvy = std::sin(theta);
  //Get a line perpendicular to the one we just generated. This is the line we
  //sort the points along to find the maximal margin.
  const auto vx = -tvy;
  const auto vy = tvx;
  //Do a dot product (v dot s)/(s dot s) and multiply by s to get projection
  for(auto &pt: pts)
    pt.frac = (pt.x*vx+pt.y*vy)/(vx*vx+vy*vy);
}
CountIntercept FindBestIntercept(Points &pts, const double margin){
  Points::iterator upper=pts.begin();
  Points::iterator lower=pts.begin();
  int best_count = 0;
  double best_pos = 0;
  for(;upper!=pts.end();++upper){
    while(upper->frac-lower->frac>margin)
      ++lower;
    const auto count = upper-lower+1;
    if(count>best_count){
      best_count = count;
      best_pos = (upper->frac-lower->frac)/2;
    }
  }
  return CountIntercept(best_count,best_pos);
}
BestLine FindBestLine(Points &pts, const int samples, const double margin){
  int    best_count = 0;
  double best_theta;
  for(int i=0;i<samples;i++){
    const double theta = i*(M_PI/samples);
    //Project points onto the line defined by theta
    GetFractionalDistPointsProjectedToVector(pts, theta);
    //Since points are already mostly sorted from the last projection, use
    //insertion sort's O(N) best case scenario to make them totally sorted.
    // ska_sort(pts.begin(),pts.end(),[](const FracPoint &pt){return pt.frac;});
    std::sort(pts.begin(),pts.end());
    const auto count_intercept = FindBestIntercept(pts, margin);
    if(count_intercept.first>best_count){
      best_count = count_intercept.first;
      best_theta = theta;
    }
  }
  //Project points onto the line defined by the best theta
  GetFractionalDistPointsProjectedToVector(pts, best_theta);
  //Since points are being projected onto a "randomly" chosen line, we use an
  //O(N log N) sort
  std::sort(pts.begin(),pts.end());
  const auto count_intercept = FindBestIntercept(pts, margin);  
  //`count_intercept.second` isn't actually the intercept, it's the distance
  //along a line perpendicular to the one defined theta at which the best
  //intercept is located. Let's find the real intercept now.
  double intercept;
  {
    //Get vector along theta
    const double tvx = std::cos(best_theta);
    const double tvy = std::sin(best_theta);
    //Get perpendicular vector
    auto vx = -tvy;
    auto vy = tvx;
    //Multiply by `counter_intercept.second` to get absolute position of best
    //point.
    vx *= count_intercept.second;
    vy *= count_intercept.second;
    //Get slope of line
    double slope = std::tan(best_theta);
    //Use point-slope equation of line y-y1=m*(x-x1) and reformulate to y=m*x+b
    //to recover the actual intercept
    intercept = -slope*vx+vy;
  }
  return {best_count, std::tan(best_theta), intercept};
}
int main(int argc, char **argv){
  if(argc!=3){
    std::cout<<"Syntax: "<<argv[0]<<" <N> <Samples>"<<std::endl;
    return -1;
  }
  const int N       = std::stoi(std::string(argv[1]));
  const int samples = std::stoi(std::string(argv[2]));
  //Poor way of initializing PRNG. Fine for this application.
  std::mt19937 gen(1234567); 
  std::uniform_real_distribution<> dis(-100, 100);
  auto dice = std::bind(dis,gen);
  Points pts;
  for(int i=0;i<N;i++)
    pts.emplace_back(dice(),dice());
  const auto ret = FindBestLine(pts, samples, 25);
  std::cout<<"Count          = "<<ret.count<<std::endl;
  std::cout<<"Best slope     = "<<ret.slope<<std::endl;
  std::cout<<"Best intercept = "<<ret.intercept<<std::endl;
  return 0;
}

Now, we time the algorithm for 100 samples at various numbers of points, which gives the following:

1000,       0.014
10000,      0.083
50000,      0.445
100000,     0.943
500000,     5.615
1000000,   12.092
5000000,   68.748
10000000, 144.155

which, indeed, is almost linear (slope of 1.028 on a log-log graph with R2=0.9962).

OpenMP can be used as a simple way to parallelize the operation:

#pragma omp parallel for firstprivate(pts)
  for(int i=0;i<samples;i++){
    const double theta = i*(M_PI/samples);
    //Project points onto the line defined by theta
    GetFractionalDistPointsProjectedToVector(pts, theta);
    //Since points are already mostly sorted from the last projection, use
    //insertion sort's O(N) best case scenario to make them totally sorted.
    // ska_sort(pts.begin(),pts.end(),[](const FracPoint &pt){return pt.frac;});
    std::sort(pts.begin(),pts.end());
    const auto count_intercept = FindBestIntercept(pts, margin);
    #pragma omp critical
    if(count_intercept.first>best_count){
      best_count = count_intercept.first;
      best_theta = theta;
    }
  }