在LightGBM和XGBoost中的自定义损失函数(Custom Loss Function)。

Training loss and Validation loss

Training loss: This is the function that is optimized on the training data. For example, in a neural network binary classifier, this is usually the binary cross entropy. For the random forest classifier, this is the Gini impurity. The training loss is often called the “objective function” as well. The training loss in LightGBM is called objective.

Validation loss: This is the function that we use to evaluate the performance of our trained model on unseen data. This is often not the same as the training loss. For example, in the case of a classifier, this is often the area under the curve of the receiver operating characteristic (ROC) — though this is never directly optimized, because it is not differentiable. This is often called the “performance or evaluation metric”. The validation loss is often used to tune hyper-parameters. It is often easier to customize, as it doesn’t have as many functional requirements like the training loss does. The validation loss can be non-convex, non-differentiable, and discontinuous. The validation loss in LightGBM is called metric.

我们可以用Validation loss做early stopping:当迭代次数(boosting rounds,树的数量)增加的时候,loss经过early_stopping_rounds不减小,则停止训练。

但是如果Validation loss function是二阶可导的,则可以考虑直接用其作为Training loss直接优化模型。

Training loss: Customizing the training loss in LightGBM requires defining a function that takes in two arrays, the targets and their predictions. In turn, the function should return two arrays of the gradient and hessian of each observation. As noted above, we need to use calculus to derive gradient and hessian and then implement it in Python.

Validation loss: Customizing the validation loss in LightGBM requires defining a function that takes in the same two arrays, but returns three values: a string with name of metric to print, the loss itself, and a boolean about whether higher is better.

官方例子-LGBM中自定义log likelihood loss:

# self-defined objective function
# f(preds: array, train_data: Dataset) -> grad: array, hess: array
# log likelihood loss
def loglikelihood(preds, train_data):
    labels = train_data.get_label()
    preds = 1. / (1. + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1. - preds)
    return grad, hess

# self-defined eval metric
# f(preds: array, train_data: Dataset) -> name: string, eval_result: float, is_higher_better: bool
# binary error
def binary_error(preds, train_data):
    labels = train_data.get_label()
    return 'error', np.mean(labels != (preds > 0.5)), False

实例

(1)自定义MSE

考虑这样一种场景,我们赶车赶飞机,预测我们的出发时间,使得我们等候时间最少。对于早到和晚到,惩罚是不一样的,早到机场火车站,无可厚非。但是要是晚到,就麻烦了… 所以显而易见,我们在建模时候需要加大迟到的惩罚。如下customMSE公式,对于迟到我们加大10倍的惩罚。

$$ customMSE = \frac{1}{N}\sum_{i} g_i(x) \\ gi(x) =\left\{\begin{array}{l}{y_i - \hat{y_i}, y_i \geq \hat{y_i}} \\ {10 \times (y_i - \hat{y_i}), y_i < \hat{y_i}}\end{array}\right. $$

该函数及其gradient和hessian可视化:

则可以自定义training loss和validation loss:

def custom_asymmetric_train(y_true, y_pred):
    residual = (y_true - y_pred).astype("float")
    grad = np.where(residual<0, -2*10.0*residual, -2*residual)
    hess = np.where(residual<0, 2*10.0, 2.0)
    return grad, hess

def custom_asymmetric_valid(y_true, y_pred):
    residual = (y_true - y_pred).astype("float")
    loss = np.where(residual < 0, (residual**2)*10.0, residual**2) 
    return "custom_asymmetric_eval", np.mean(loss), False
import lightgbm
# ********* Sklearn API **********
# default lightgbm model with sklearn api
gbm = lightgbm.LGBMRegressor() 
# updating objective function to custom
# default is "regression"
# also adding metrics to check different scores
gbm.set_params(**{'objective': custom_asymmetric_train}, metrics = ["mse", 'mae'])
# fitting model 
gbm.fit(
    X_train,
    y_train,
    eval_set=[(X_valid, y_valid)],
    eval_metric=custom_asymmetric_valid,
    verbose=False)
y_pred = gbm.predict(X_valid)

# ********* Python API **********
# create dataset for lightgbm
# if you want to re-use data, remember to set free_raw_data=False
lgb_train = lgb.Dataset(X_train, y_train, free_raw_data=False)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train, free_raw_data=False)
# specify your configurations as a dict
params = {'objective': 'regression', 'verbose': 0}
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                init_model=gbm,
                fobj=custom_asymmetric_train,
                feval=custom_asymmetric_valid,
                valid_sets=lgb_eval)           
y_pred = gbm.predict(X_valid)

完整代码在这里

(2)Cost-sensitive Logloss

同比上述例子,在疾病诊断中,FN和FP的惩罚也应该是要不一样的。没病判断成有病还好(FP: False Positive),有病判断为没病则会更可怕(FN: False Negative),比如漏诊了癌症耽误了最佳治疗时间。所以我们可以定义一个Loss加大对FN的惩罚:

$$ \begin{aligned} customLogLoss &= -\frac{1}{N} \sum_{i} (5 \times FN+FP) \\ FN &= y \times \log (\hat{y}) \\ FP &= (1-y) \times \log (1-\hat{y}) \\ \hat{y} &=min(max(p, 10^{-7}), 1-10^{-7})\\ p &=\frac{1}{1+e^{-x}} \end{aligned} $$

该Loss Function的gradient 和 hessian:

$$ \begin{array}{l}{\frac{d L o s s}{d x}=4 p y+p-5 y} \\ {\frac{d^{2} L o s s}{d x^{2}}=(4 y+1) * p(1.0-p)}\end{array} $$

其中sigmoid函数求导:

$$ \begin{array}{l}{p = \frac{e^x}{e^x+1}\;,\ 1-p = \frac{1}{e^x+1}} \\ {\frac{dp}{dx}= \frac{e^x(e^x+1)-(e^x*e^x)}{(e^x+1)^2} = \frac{e^x}{(e^x+1)^2} = p*(1-p)}\end{array} $$

Python中自定义training loss和validation loss:

def logistic_obj(y_hat, dtrain):
    y = dtrain.get_label()
    p = y_hat 
    # p = 1. / (1. + np.exp(-y_hat)) # 用于避免hessian矩阵中很多0
    grad = p - y
    hess = p * (1. - p)
    grad = 4 * p * y + p - 5 * y
    hess = (4 * y + 1) * (p * (1.0 - p))
    return grad, hess

def err_rate(y_hat, dtrain):
    y = dtrain.get_label()
    # y_hat = 1.0 / (1.0 + np.exp(-y_hat)) # 用于避免hessian矩阵中很多0
    y_hat = np.clip(y_hat, 10e-7, 1-10e-7)
    loss_fn = y*np.log(y_hat)
    loss_fp = (1.0 - y)*np.log(1.0 - y_hat)
    return 'error', np.sum(-(5*loss_fn+loss_fp))/len(y), False

同样的,如果要FP加大惩罚:

假设第i个预测样本的Loss为L,p(x)为sigmoid函数,我们用 $\beta\,(>1)$来表示FP的权重。

$$ \begin{array}{c}{L=-y \ln p-\beta(1-y) \ln (1-p)} \\ {\text { grad }=\frac{\partial L}{\partial x}=\frac{\partial L}{\partial p} \frac{\partial p}{\partial x}=p(\beta+y-\beta y)-y} \\ {\quad \text { hess }=\frac{\partial^{2} L}{\partial x^{2}}=p(1-p)(\beta+y-\beta y)}\end{array} $$

def weighted_logloss(y_hat, dtrain):
    y = dtrain.get_label()
    p = y_hat
    beta = 5
    grad = p * (beta + y - beta*y) - y
    hess = p * (1 - p) * (beta + y - beta*y)
    return grad, hess

(3)JDATA比赛中自定义Loss

18年如期而至-用户购买时间预测比赛要求利用脱敏后的京东真实用户历史行为数据,建立算法模型,预测热销品类的用户购买时间。其中的评价指标如下:

$$ \begin{array}{c}{S_{2}=\frac{\sum_{u \in U_{r}} f(u)}{\left|U_{r}\right|}} \\ {f(u)=\left\{\begin{array}{l}{0, u \notin U_{r}} \\ {\frac{10}{10+d_{u}^{2}}, u \in U_{r}}\end{array}\right.}\end{array} $$ 其中,$U_r$为答案用户集合,$d_u$表示用户$u$的预测日期与真实日期之间的距离。

其中一个参赛团队用了自定义的loss去优化模型:

# S2 loss
def my_loss(preds, train_data):
    labels = train_data.get_label()
    return 's2_error', np.mean(10/(10+np.square(preds-labels))), True

# S2 一阶、二阶导
def my_objective(preds, train_data):
    labels = train_data.get_label()
    d = preds-labels
    x = (10.+np.square(d))
    grad = -20*d/np.square(x)
    hess = 80*np.square(d)*np.power(x,-3)-20*np.power(x,-2)
    return -grad, -hess

# 训练
model_day = lightgbm.train(cu_params,rtrain_set, num_boost_round=20000,
                           valid_sets=[rvalidation_set], fobj = my_objective,
                           feval = my_loss, valid_names=['valid'],
                           early_stopping_rounds=150, verbose_eval=200)

(4)Focal loss

论文:https://arxiv.org/pdf/1708.02002.pdf

数学原理:

a. cross entropy

$$ \operatorname{CE}(p, y)=\left\{\begin{array}{ll}{-\log (p)} & {\text { if } y=1} \\ {-\log (1-p)} & {\text { otherwise }}\end{array}\right. \\ -------------------- \\ p_{\mathrm{t}}=\left\{\begin{array}{ll}{p} & {\text { if } y=1} \\ {1-p} & {\text { otherwise }}\end{array}\right. \\ -------------------- \\ => \mathrm{CE}(p, y)=\mathrm{CE}\left(p_{\mathrm{t}}\right)=-\log \left(p_{\mathrm{t}}\right) $$

上式中CE为cross entropy,p为预测y为1的概率。

cross entropy对于样本严重不均衡的问题,大多数的样本很易分类正确,虽然单个样本的loss很小,但是加起来却也会很大,超过小样本的loss,使得这些少类的样本很难被正确预测。

一个常用的做法是通过增加权重对少样本增大惩罚,如下公式(α-balanced CE loss)。(2)中其实用的就是这样的方法。

$$ \mathrm{CE}\left(p_{\mathrm{t}}\right)=-\alpha_{\mathrm{t}} \log \left(p_{\mathrm{t}}\right)\\ ------------\\ \alpha_{\mathrm{t}}=\left\{\begin{array}{ll}{\alpha} & {\text { if } y=1} \\ {1-\alpha} & {\text { otherwise }}\end{array}\right. \\ $$ 其中,$\alpha \in [0,1]$,一般根据正负样本比例确定,或者作为超参用cross-validation确定。

b. focal loss

focal loss可以较好的解决样本不均衡问题,它在设计上减小了分类正确的大类样本(易预测)的损失,而让模型对小类样本进行更大的“关注”。

$$ \mathrm{FL}\left(p_{\mathrm{t}}\right)=-\left(1-p_{\mathrm{t}}\right)^{\gamma} \log \left(p_{\mathrm{t}}\right) $$ 其中,$\gamma \geq 0$. 在$\gamma = 2$的情况下,$p_t=0.9$的样本loss将比CE的低100倍,$p_t\approx0.968$的样本loss将比CE的低1000倍,这会提高分类错误样本的重要性,对其进行更大的“关注”($p_t\leq0.5$的样本loss仅比CE低4倍)。 在实际应用中,α-balanced Focal loss更常用: $$ \mathrm{FL}\left(p_{\mathrm{t}}\right)=-\alpha_{\mathrm{t}}\left(1-p_{\mathrm{t}}\right)^{\gamma} \log \left(p_{\mathrm{t}}\right) $$

Focal Loss for LightGBM:

def focal_loss_lgb(y_pred, dtrain, alpha, gamma):
  a,g = alpha, gamma
  y_true = dtrain.label
  def fl(x,t):
  	p = 1/(1+np.exp(-x))
  	return -( a*t + (1-a)*(1-t) ) * (( 1 - ( t*p + (1-t)*(1-p)) )**g) * ( t*np.log(p)+(1-t)*np.log(1-p) )
  partial_fl = lambda x: fl(x, y_true)
  grad = derivative(partial_fl, y_pred, n=1, dx=1e-6)
  hess = derivative(partial_fl, y_pred, n=2, dx=1e-6)
  return grad, hess

def focal_loss_lgb_eval_error(y_pred, dtrain, alpha, gamma):
  a,g = alpha, gamma
  y_true = dtrain.label
  p = 1/(1+np.exp(-y_pred))
  loss = -( a*y_true + (1-a)*(1-y_true) ) * (( 1 - ( y_true*p + (1-y_true)*(1-p)) )**g) * ( y_true*np.log(p)+(1-y_true)*np.log(1-p) )
  # (eval_name, eval_result, is_higher_better)
  return 'focal_loss', np.mean(loss), False

focal_loss = lambda x,y: focal_loss_lgb(x, y, alpha=0.25, gamma=1.)
focal_loss_eval = lambda x,y: focal_loss_lgb_eval_error(x, y, alpha=0.25, gamma=1.)
model = lgb.train(best, self.lgtrain, fobj=focal_loss, feval=focal_loss_eval)

# or with f1 eval
def focal_loss_lgb_f1_score(preds, lgbDataset):
  preds = sigmoid(preds)
  binary_preds = [int(p>0.5) for p in preds]
  y_true = lgbDataset.get_label()
  return 'f1', f1_score(y_true, binary_preds), True

focal_loss = lambda x,y: focal_loss_lgb(x, y, alpha=0.25, gamma=1.)
cv_result = lgb.cv(
	params,
	train,
	num_boost_round=num_boost_round,
	fobj = focal_loss,
	feval = focal_loss_lgb_f1_score,
	nfold=3,
	stratified=True,
	early_stopping_rounds=20)

REFERENCE