Set and Cluster Similarity in R
The following set similarity functions are written in C++ for easy integration
with R using the sourceCpp
command.
The first function computes the Jaccard index assuming inputs are unique, sorted integers where each integer denotes an element of a set. L1 and L2 are both sets.
/**
* The Jaccard Similarity Coefficient or Jaccard Index is used to compare the
* similarity/diversity of sample sets. It is defined as the size of the
* intersection of the sets divided by the size of the union of the sets.
*
* INPUTS:
* L1: A list of unique, sorted integers denoting set elements.
* L2: A list of unique, sorted integers denoting set elements.
*
* RETURNS:
* Size(Intersection(L1,L2))/Size(Union(L1,L2))
*
* SIDE-EFFECTS:
* None: this is the 21st-century.
*
* COMPLEXITY:
* Time: O(N1+N2)
* Space: O(1)
*/
// [[Rcpp::export]]
NumericVector JaccardSimilarity(const NumericVector L1, const NumericVector L2){
int n1 = L1.size(); //Number of elmeents in set 1
int n2 = L2.size(); //Number of elmeents in set 2
int i1 = 0; //Current element of interest in set 1
int i2 = 0; //Current element of interest in set 2
int size_intersection = 0; //Result: to be incremented in algorithm
int size_union = 0; //Result: to be incremented in algorithm
while(true){
if(i1==n1){
size_union += n2-i2;
break;
} else if (i2==n2){
size_union += n1-i1;
break;
}
if(L1[i1]==L2[i2]){ //Element is in both sets.
i1++; //Therefore, we discard both the current element
i2++; //from both sets and mark the current element as
size_union++; //being part of both the union and the
size_intersection++; //intersection
} else if(L1[i1]<L2[i2]){ //Element is only in the first set.
i1++; //Throw it out
size_union++; //and mark it as a member of only the union
} else if(L1[i1]>L2[i2]){ //Element is only in the second set.
i2++; //Throw it out
size_union++; //and mark it as a member of only the union
}
}
return NumericVector::create((double)size_intersection/(double)size_union);
}
The second function computes the Jaccard index assuming only that L1 and L2 are both sets whose integers indicate set members.
/**
* The Jaccard Similarity Coefficient or Jaccard Index is used to compare the
* similarity/diversity of sample sets. It is defined as the size of the
* intersection of the sets divided by the size of the union of the sets.
*
* INPUTS:
* L1: A list of integers denoting set elements.
* L2: A list of integers denoting set elements.
*
* RETURNS:
* Size(Intersection(L1,L2))/Size(Union(L1,L2))
*
* SIDE-EFFECTS:
* None: this is the 21st-century.
*
* COMPLEXITY:
* Time: O(N1+N2)
* Space: O(N1+N2) worst-case
*/
#include <unordered_map>
// [[Rcpp::export]]
NumericVector JaccardSimilarityUnsorted(const NumericVector L1, const NumericVector L2){
int n1 = L1.size(); //Number of elmeents in set 1
int n2 = L2.size(); //Number of elmeents in set 2
int size_intersection = 0; //Result: to be incremented in algorithm
int size_union = 0; //Result: to be incremented in algorithm
std::unordered_map<int, unsigned char> seen;
for(int i1=0;i1<n1;i1++){ //Loop through all elements in Set 1
if(!seen.count(L1[i1])){ //If we haven't seen the element before
seen[L1[i1]] = 1; //Mark it as having been seen in Set 1
size_union++; //And as being part of the union
}
}
for(int i2=0;i2<n2;i2++){ //Loop through all elements in Set 2
if(!seen.count(L2[i2])){ //If we have not seen the element before
size_union++; //then it is part of the union.
seen[L2[i2]] = 2; //Mark it as being done.
} else if(seen[L2[i2]]==1){ //Elsewise, we've seen the element in the
size_intersection++; //Set 1, it is part of the intersection
//But not the union, because that would
//be double counting.
seen[L2[i2]] = 2; //Mark it as being done.
}
}
return NumericVector::create((double)size_intersection/(double)size_union);
}
This function, due to Jamie Murdoch, quickly computes the Fowlkes-Mallows index of a cluster assignment.
/**
* Third iteration of the Fowlkes and Mallows similarity in C++.
* For each cluster i, track the number of elements in that cluster.
* For each pair of clusters (x, y), track the number of elements assigned to x
* in L1 and to y in L2. This is the "overlap" between x and y.
* <C1, C1> is the sum of squares of cluster sizes. To see this, note that
* for all pairs of elements i, j in cluster x, c_{ij} = 1. For all pairs
* of elements in different clusters, c_{ij} = 0. Then, for each cluster x,
* the number of elements with value 1 in C_{ij} is |x|^2, the square of the
* cluster size.
* Another way to see this is that C_{ij} is an adjacency matrix for a graph, in
* which the clusters represent separate cliques. We want to count the number
* of edges for each clique, counting self-edges once and undirected edges
* between elements twice.
* <C1, C2> is the sum of squares of overlap sizes. To see this, consider the
* overlap O_{xy} between clusters x and y. This is the number of elements that
* belong to x in L1 and to y in L2. For any pair (i,j) of these elements,
* C_{ij} = 1 in both C1 and C2. This ordered pair contributes 1 to <C1, C2>.
* We sum these contributions across all pairs in the overlap (of which there
* are |O_{xy}|^2) and all overlaps.
*
* INPUTS:
* L1: Each item of the list is a point. The number stored at the element
* is the cluster the point is assigned to.
* L2: The same as L1. Must be the same length as L1.
*
* RETURNS:
* The Fowlkes and Mallows index
*
* SIDE-EFFECTS:
* None
*
* COMPLEXITY:
* Time: O(K^2+n), where K = number of clusters
* Space: O(K^2)
*
* SOURCES:
* Asa Ben-Hur, Andre Elisseeff, and Isabelle Guyon (2001) A stability based
* method for discovering structure in clustered data. Biocomputing 2002: pp.
* 6-17.
*/
// [[Rcpp::export]]
NumericVector FowlkesMallows(const NumericVector L1, const NumericVector L2){
int n = L1.size();
int K = max(L1);
ind overlaps[K][K];
ind cluster_sizes1[K], cluster_sizes2[K];
for(int i = 0; i < K; i++){
cluster_sizes1[i] = 0;
cluster_sizes2[i] = 0;
for(int j = 0; j < K; j++)
overlaps[i][j] = 0;
}
//O(n) time. O(K^2) space. Determine the size of each cluster as well as the
//size of the overlaps between the clusters.
for(int i = 0; i < n; i++){
cluster_sizes1[(int)L1[i] - 1]++; // -1's account for zero-based indexing
cluster_sizes2[(int)L2[i] - 1]++;
overlaps[(int)L1[i] - 1][(int)L2[i] - 1]++;
}
// O(K^2) time. O(1) space. Square the overlap values.
int C1dotC2 = 0;
for(int j = 0; j < K; j++){
for(int k = 0; k < K; k++){
C1dotC2 += pow(overlaps[j][k], 2);
}
}
// O(K) time. O(1) space. Square the cluster sizes
int C1dotC1 = 0, C2dotC2 = 0;
for(int i = 0; i < K; i++){
C1dotC1 += pow(cluster_sizes1[i], 2);
C2dotC2 += pow(cluster_sizes2[i], 2);
}
return NumericVector::create((double)C1dotC2 / (sqrt(C1dotC1) * sqrt(C2dotC2)));
}
We can modify the above to quickly compute the cluster assignment similarity using the Jaccard index, as suggested by Ben-Hur, Elisseeff, and Guyon (2002).
/**
* The Jaccard Similarity Coefficient or Jaccard Index is used to compare the
* similarity/diversity of sample sets. It is defined as the size of the
* intersection of the sets divided by the size of the union of the sets. Here,
* it is used to determine how similar to clustering assignments are.
*
* INPUTS:
* L1: A list. Each element of the list is a number indicating the cluster
* assignment of that number.
* L2: The same as L1. Must be the same length as L1.
*
* RETURNS:
* The Jaccard Similarity Index
*
* SIDE-EFFECTS:
* None
*
* COMPLEXITY:
* Time: O(K^2+n), where K = number of clusters
* Space: O(K^2)
*
* SOURCES:
* Asa Ben-Hur, Andre Elisseeff, and Isabelle Guyon (2001) A stability based
* method for discovering structure in clustered data. Biocomputing 2002: pp.
* 6-17.
*/
// [[Rcpp::export]]
NumericVector JaccardIndex(const NumericVector L1, const NumericVector L2){
int n = L1.size();
int K = max(L1);
ind overlaps[K][K];
ind cluster_sizes1[K], cluster_sizes2[K];
for(int i = 0; i < K; i++){
cluster_sizes1[i] = 0;
cluster_sizes2[i] = 0;
for(int j = 0; j < K; j++)
overlaps[i][j] = 0;
}
//O(n) time. O(K^2) space. Determine the size of each cluster as well as the
//size of the overlaps between the clusters.
for(int i = 0; i < n; i++){
cluster_sizes1[(int)L1[i] - 1]++; // -1's account for zero-based indexing
cluster_sizes2[(int)L2[i] - 1]++;
overlaps[(int)L1[i] - 1][(int)L2[i] - 1]++;
}
// O(K^2) time. O(1) space. Square the overlap values.
int C1dotC2 = 0;
for(int j = 0; j < K; j++){
for(int k = 0; k < K; k++){
C1dotC2 += pow(overlaps[j][k], 2);
}
}
// O(K) time. O(1) space. Square the cluster sizes
int C1dotC1 = 0, C2dotC2 = 0;
for(int i = 0; i < K; i++){
C1dotC1 += pow(cluster_sizes1[i], 2);
C2dotC2 += pow(cluster_sizes2[i], 2);
}
return NumericVector::create((double)C1dotC2/(double)(C1dotC1+C2dotC2-C1dotC2));
}