Skip to content

Finding the largest array index with negative value

An answer to this question on Stack Overflow.

Question

Given an array of positive elements (1 based indexing), you have to process two types of queries:

  1. (V) find the sum of numbers in the range 1:V (both inclusive)
  2. (V, X) subtract the number X to from all in the range 1:V and report the largest index i in range 1:V such that the value at that index is negative, where the answer for this query is 0 if no such index exists.

I can do the first query using fenwick tree or segment tree but how do i support second query? I have already tried an O(n) time per query approach just checking each element in range 1...V but it times out. I need to process 10^5 such queries over an array of size 10^5.

Answer

The most straight-forward approach is to find the element in O(n) time by simply searching through the array, like so:

arr = [0,5,2,3,10]
largest_i = -1
X = 7
for i in range(len(arr)):
  if arr[i]-X<0:
    largest_i = i
largest_i+1 #Your answer (shifting from a 0-based to a 1-based index)

Given the small size of the array and the time constraints, this should work in any compiled and most interpreted languages.

EDIT

I stand corrected (and it should have been obvious that the worst-case of 10^10 was pretty bad). Since you say this times out, here's a more a sophisticated approach.

  1. Create an AVL tree, or another self-balancing binary tree which supports insertion and removal.
  2. Create key items with attributes value and index. value is the value of the item whereas index is the item's position in the flat array.
  3. Create a hashmap that links to tree nodes based on their index.
  4. Add items to the tree so that it balances by value, but remove items based on index.
  5. When your flat array is updated, update the tree accordingly.
  6. To answer Query 2, do an in-order traversal of the tree until value-X>=0. Return the index of the last nodes for which this was false.

Insertion and deletion of the tree are both in O(log n) whereas the in-order traversal has a worst-case of O(n), but is guaranteed to check only elements which may be negative.

A possible implementation for this is as follows:

#include <map>
#include <vector>
#include <unordered_map>
#include <algorithm>
#include <cstdlib>
#include <cassert>
class MagicArray {
 private:
  typedef std::multimap<int, int> arr_idx_t;
  std::vector<int> arr; //Array of possibly negative integers
  arr_idx_t arr_sorted; //Self-balancing tree of (Integer, Index) pairs
  std::unordered_map<int, arr_idx_t::iterator> arr_idx; //Hash table linking Index to a (Integer, Index)
  void indexElement(const int idx){
    auto ret = arr_sorted.emplace(arr.at(idx),idx);
    arr_idx[idx] = ret;
  }
 public:
  void insert(const int i){
    arr.emplace_back(i);
    const auto idx = arr.size()-1; //Index of most recently inserted element
    indexElement(idx);
  }
  void alter(const int idx, const int newval){
    arr.at(idx) = newval;
    arr_sorted.erase(arr_idx[idx]); //Remove old value from tree
    indexElement(idx);
  }
  int findMatch(const int X){
    //The next two lines reduce run-time from 3s to 0.031s
    if(arr_sorted.rbegin()->first-X<0) //Even largest element is less than zero
      return arr.size();
    int foundi = -1;
    for(const auto &kv: arr_sorted){
      if(kv.first-X<0){
        foundi = std::max(foundi,kv.second);
      } else {
        break;
      }
    }
    return foundi+1;
  }
};
int main(){
  assert(RAND_MAX>10000000); //Otherwise code below will not work
  MagicArray ma;
  for(unsigned int i=0;i<10000;i++)
    ma.insert(rand()%10000);
  for(unsigned int i=0;i<10000;i++){
    ma.alter(rand()%10000,rand()%10000);
    ma.findMatch(rand()%1000000);
  }
}

If you leave out the first two code lines of findMatch this takes 3s on my Intel(R) Core(TM) i5 CPU M480@2.67GHz and 2.359s on an Intel(R) Xeon(R) CPU E5-2680v3@2.50GHz on a supercomputer.

If you include the first two code lines of findMatch then the code takes <0.035s on both machines.

This is why it's very important to consider the ranges of the values. You said that the array includes values in the range [0,10⁵] whereas X is in the range [0,10⁷], this means that 99% of the values X takes will be larger than any value in the array and the answer will therefore simply be the size of the array.

So the trick is to use an inexpensive check to see if we know the answer simply and, if not, to then perform the more expensive search.