GSOC'23 Week 9 & 10: Implementing Classdef for GAM and comparision with other softwares
Table of contents
Classdef in Octave
For Matlab compatibility and better representation of GAM model we will need a classdef RegressionGAM for storing the training data and other model parameters. The classdef here would serve as a Representation of the Training data, store trained Model parameters (pp structure containing the spline coeffs) and the object can be given directly to different functions to calculate and predict new values.
In Octave, The class containers are classdef. A classdef should be in a separate folder with the name @classdefname, this folder should contain the code for the classdef. Member functions can either be defined within the code as member functions or can be implemented as functions as seperate files, within the @classdef folder. The functions implemented inside the folder are still considered the member functions of the class.
In the octave class container classdef follows the following structure:
classdef RegressionGAM
properties (Access = private)
property1;
property2;
.....
endproperties
methods (Access = public)
function y = methods1 (params)
...
endfunction
endmethods
endclassdef
The predict function will be the same as gampredict with the difference of the properties used for calculation would be passed through the object of the RegressionGAM.
function yhat = predict (obj)
......
endfunction
For function predict we have to implement the same link function g(u) to predict new values,
$$g(\mu) = b_0 + f(x_1) + f(x_2) + \ldots + f(x_p)$$
Here b0 is the "Intercept" and function f(x1) to f(xp) are the pp structure of spline fitted for each Predictor X1 to Xp.
Each pp structure Returns :
ans =
scalar structure containing the fields:
form = pp
breaks =
-0.9986 -0.6037 -0.2553 0.1376 0.5728 0.9461
coefs =
-3.5718 5.4168 0.6050 -1.1549
-4.9675 1.1851 3.2122 -0.2911
-0.9905 -4.0068 2.2292 0.7618
4.8098 -5.1743 -1.3780 0.9590
2.0628 1.1050 -3.1489 -0.2241
pieces = 5
order = 4
dim = 1
Comparison with R and MATLAB
MATLAB uses Gradient Boosting Trees instead of The Backfitting Algorithm. Which uses Regression Trees for fitting predictors instead of spline and Ensembling methods to Train the Model.
Comparing Results for the two functions used in Predicting
$$\text{1. } f_1(x)= sin (3x) \text {, 2. }f_2(x)= x^3$$
For Comparision, I have generated 100 samples for each function f1 and f2 with some added Noise.
x1 | x2 | y | predicted (MATLAB) | Y-pred (R-gam) | Y-pred(R-mgcv) | Y-pred (gamfit1) | Y-pred (gamfit2) | |
1 | 0.6294 | -0.6756 | -0.7006 | -0.7114 | -0.628432882366799 | -0.660784725686732 | -0.66318 | -0.67384 |
2 | 0.8116 | 0.5886 | -0.5987 | -0.6001 | -0.504615889727815 | -0.584152395378186 | -0.59238 | -0.59944 |
3 | -0.746 | -0.3776 | -0.7819 | -0.7761 | -0.677328836421913 | -0.738243051430922 | -0.74523 | -0.75526 |
4 | 0.8268 | 0.0571 | -0.8731 | -0.8197 | -0.802120390180042 | -0.837626979227908 | -0.8523 | -0.86255 |
5 | 0.2647 | -0.6687 | 0.4301 | 0.4308 | 0.370205706172812 | 0.462471652002233 | 0.46694 | 0.45924 |
... | ... | ... | ... | ... | ... | ... | ... | |
98 | 0.0215 | 1.1233 | 1.1233 | 1.1302 | 0.998285409145992 | 1.07973595271383 | 1.0693 | 1.0588 |
99 | 0.6353 | -0.7932 | -0.7932 | -0.8229 | -0.86954028692935 | -0.822386172612395 | -0.79388 | -0.80125 |
100 | 0.5897 | 0.8363 | 0.8363 | 0.8245 | 0.838211615743304 | 0.838263477143728 | 0.82298 | 0.81463 |
In GAM regression choice of "convergence" of algorithm is Not Defined in the Original paper by Tibshirani and Hastie. Also, the choice of centering the coefficients of the fitted spline is Optional. In gampredict1 Implementation, the coefficients are not centered However in gampredict2 the coefficients are centered and the residuals are calculated by predicting values excluding the jth predictor.
The choice of convergence of the Backfitting Algorithm also varies. The algorithm should be converged when the changes in the coefficient are 0 or are smaller than the small tolerance (0.00001) Or the RSS goes to 0.
$$RSS = \frac{1}{n} || Y - s_0 - \sum_{j=1}^{p}s_j^{m} (X_j)||^2$$
Until RSS doesn't decrease or satisfy convergence Criteria. In both the gampredict functions RSS is used to check convergence.
Further Improvements
Implementing GAM using a Regression Tree instead of a spline and gradient boosting would give the gampredict a huge time advantage.
The main loop of the gampredict1 an outer while loop is used for each backfitting iteration until convergence and an inner for loop to traverse through all the predictors which makes the runtime considerably Big, with more data samples and lower value of the tolerance the number of iterations to reach convergence Increases and so is the runtime.
while (! converged)
iterations += 1;
for j = 1:columns (X)
if ( itt > 1) #(i > 1)
res = res + ppval (ppf (j), X (:, j));
endif
gk = splinefit (X (:,j) , res, 5, 'order', 3);
gk.coefs = gk.coefs - sum (sum (gk.coefs)) / rows (X);
RSSk (j) = abs (sum (abs (y - ppval (gk, X(:,j)) - m )) .^ 2) / rows (y)
ppf (j) = gk;
res = res - ppval (ppf (j), X (:,j));
endfor
if (all (abs (RSS - RSSk) <= tol))
converged = true;
endif
RSS = RSSk;
endwhile
In the implementation of gampredict2 same double loop is used for implementing the main loop, The calculation of residuals is done by removing the jth predictor and predicting the values which adds to the calculation, Making the runtime of the gampredict2 greater than gampredict1.
while (! converged)
n_itn += 1;
for j = 1:columns (X)
gk = ppfk;
gk (j).coefs (:,:) = 0;
yk = res - predict_val (gk, X, 0);
gk = splinefit (X (:,j), yk, 5, 'order', 2);
gk.coefs = gk.coefs - sum (sum (gk.coefs)) / rows (X);
dif (j) = sum (sum (abs (ppfk (j).coefs - gk.coefs) >= tol))
#RSSk (j) = abs (sum (abs (y - ppval (gk, X(:,j)) - alpha )) .^ 2) / rows (y);
ppfk (j) = gk;
endfor
if ( all (dif <= 0))
converged = true;
endif
endwhile