L2R Gradients

Computation of gradient updates for the L2R models

The following notations are borrowed from From RankNet to LambdaRank to LambdaMART: An Overview

Let \(I\) denote the pair of indices \(\{i, j\}\), for which we desire token_i to be ranked differently from token_j (for a given label group). Since we must include each pair just once, so it is convenient to consider pairs of indices \(\{i, j\}\) for which token_i is more relevant than token_j.

\[\lambda_{ij} = \sigma \left\{ \frac{1}{2}(1 - S_{ij}) - \frac{1}{1+e^{ \sigma(p_i - p_j)}} \right\} \textsf{ Eq: 3},\] where \(\sigma\) is a hyper-parameter which controls the shape of the sigmoid and \(p_i, p_j\) are predictions made by the model for token_i and token_j respectively, and

\[S_{ij} = \begin{cases} 1, & \text{if token_i is more relevant} \\ 0, & \text{if token_i is as relevant as token_j} \\ -1, & \text{if token_j is ore relevant} \end{cases}\]

The weight update rule in gradient descent is given by: \[\delta w_k = \eta \sum_{\{i,j\} \in I} (\lambda_{ij} \frac{\partial p_i}{\partial w_k} - \lambda_{ij} \frac{\partial p_j}{\partial w_k}) = -\eta \sum_i \lambda_i \frac{\partial p_i}{\partial w_k},\] where

\[\lambda_i = \sum_{j: \{i,j\} \in I} \lambda_{ij} - \sum_{j: \{j,i\} \in I} \lambda_{ji} \textsf{ Eq: 4}.\]

Implementing the above equations:

(Handcrfted Gradients)

We can think of the tensor returned by _summation as essentially the summation notation in eq:4 above. It has three dimension. The length of the zeroth dim is the number of tokens. And each token contains a 2d tensor. For each token the zeroth and the first dim of 2d tensor has the following interpretation.

For each token in a sequence (i.e. the i’s) it contains the information about the other tokens (i.e. the j’s) that
1. The first column value tells us the row num we got to index in the pairs array. 2. The last column value tells us whether i is more relevant or less relevant than j. In other words, it determines the sign while computing \(\lambda_i\) in eq: 4.

source

rank_loss2

 rank_loss2 (preds, xb, sigma=0.5, lambrank=False, gain_fn=None, k=6)

source

rank_loss3

 rank_loss3 (preds, xb, sigma=0.5, lambrank=False, gain_fn=None, k=6)

If we were to use a loss fuunction instead of hand creafted gradients:

\[C = \sum_{\{i,j\} \in I} \frac{1}{2}(1 - S_{ij})\sigma(p_i-p_j) + \log(1 + e^{-\sigma(p_i - p_j)})\]

source

loss_fn2

 loss_fn2 (preds, xb, sigma=0.5)

Computes average pairwise cross-entropy loss

source

loss_fn

 loss_fn (preds, xb, sigma=0.5)