Computing inverse of a singular matrix for your ML model in Numpy/PyTorch/TensorFlow
Fix also works for Mahalabonis distance calculation in Python (NumPy/TensorFlow/PyTorch), or any other distance measures using inverse matrix calculations over non-singular matrices.
TL;DR. Use np.linalg.pinv() instead of np.linalg.inv() to calculate the inverse of your matrix!
I was calculating the pair-wise distance matrix using scipy.spatial.distance.pdist(). I wanted to feed my sklearn-based classifier a custom loss function and/or a distance metric.
I wrote a custom function customDistanceFunction() that will be called like scipy.spatial.distance.pdist(data,customDistanceFunction).
I was intending on calculating Mahalabonis pair-wise distance matrix simply using Pandas or NumPy because rest of my classes use just those (and sklearn). While I looked for existing Python implementations for Mahalabonis distance calculation, implementations in TensorFlow/PyTorch that I found would need more modification than I would like. My existing code has predefined formats and data types that I tried to keep consistent so far.
So I decided write my own TensorFlow/PyTorch version of the Mahalabonis distance calculation inside customDistanceFunction().
Problem I ran into: Creating a covariance matrix!
TensorFlow, scipy, and NumPy provide their own covariance (matrix) functions. In fact, TensorFlow provides THREE —-> tfp.stats.covariance, tft.covariance, tfp.stats.cholesky_covariance. The last one uses Cholesky decomposition (read more here) and can substitute the traditional covariance matrix calculation methods.
Nuances in data types, dimensions, formats and precision levels between the 3 libraries took more time of mine that I am proud to admit. The actual calculation itself is straightforward. The colors in the customDistanceFunction() each belong a single source of problems (hence they are commented).
def customDistanceFunction(x, y):
diff = x - y
#cov_mat = tfp.stats.covariance(x, y, sample_axis=0, event_axis=1) #### Issue with dimensions
#z = torch.cat((x, y), 0) #### Issue with alignment
#cov_mat = torch.cov(z, correction=1, fweights=None, aweights=None)
#inv_cov_mat = torch.linalg.inv(cov_mat)
M = np.multiply.outer(x,y).T
V = np.cov(M.T)
#IV = np.linalg.inv(np.matrix(V))#,atol=1e-15) #### I scaled my data samples in x and y to a very low precision (eg: 0.000000057).
IV = np.linalg.pinv(np.matrix(V),rcond=1e-15)
#IV = torch.linalg.inv(torch.tensor(tf.convert_to_tensor(V)))
IV = torch.tensor(IV)
m = torch.dot(delta, torch.matmul(IV, diff))
c = torch.sqrt(m)
return c
I ended up using NumPy for base calculations and made necessary conversions where needed (BLACK code lines).
The error I encountered when I used np.linalg.inv() to calculate the matrix inverse was “Error: Singular Matrix”. Upon printing out variables and abusing my Jupyter notebook I found that my matrix determinant was ZERO!
NumPy and Python in general is very precise and go operate on very very very small values but the np.linalg.inv() function and its implementation was rounding my matrix elements to ZERO. Unfortunately, np.linalg.inv() does not take an argument for tolerance (atol= or tol=) like some other NumPy functions do.
Solution:
np.linalg.pinv() because it takes tolerance as an argument using the parameter rcond=. As you can see I kept the rcond value very low so the function np.linalg.pinv() does not round my digits off to ZERO.
My customDistanceFunction() worked like a charm and I went off to finish rest program.
BONUS: Obviously, the issue will happen if you try running kNN() using “mahalabonis as the “metric” aka kNN(metric=’mahalabonis’.
Solution: Calculate your own inverse of the covariance matrix and pass it as input to KNN() using the “metric_params” option “VI” like below
XtrainCov = np.cov(Xtrain) ##calculate the covariance of input set
XtrainInv = IV = np.linalg.pinv(XtrainCov, rcond=1e-15) ##calculate the inv of the covariance matrix
KNN(algorithm='auto', metric='mahalanobis', metric_params={'VI': X_train_cov_inv}) ##Initialize your KNN class