超参数调优一直是机器学习里比较intractable的问题,繁多的超参数以及指数型爆炸的参数空间,往往让人无从下手。调参是一个很枯燥的过程,而且最后也不一定有很好的reward。很多的机器学习工程师也会戏称自己是”调参民工”,”炼丹师”……

超参数(Hyper-parameters):Hyper-parameters are parameters that are not directly learnt within estimators.

超参数的优化可以看做是这样一个方程: $$ x^{\star}=\arg \min _{x \in \mathcal{X}} f(x) $$ 其中$ f(x) $ 表示目标函数,用于衡量误差,可以是MSE, RMSE, MAE等(如果是accuracy等指标可以将其添加负号求最小值),$x^{\star}$是使得$f(x)$最小的参数组合,理论上我们的目标就是要找到这个$x^{\star}$,但是要找到这样一个全局最优解几乎是不可能的。首先这个$f(x)$是个黑盒子,我们没法直接进行优化求解。事实上,我们每次为了得到$f(x)$,需要经过模型训练和评估,非常耗时,尤其对于一些深度学习模型,这个过程会特别漫长。我们只能在有限的计算资源和时间内,得到一些相对的局部最优解。

一般的调参方法有下面几种:

1. 手动调参

对于手动调参,会对模型最重要的一些参数基于经验进行调节。比如lightgbm的叶num_leaves, learning_rate, feature_fraction, lambda_l1,lambda_l2, min_data_in_leaf等。

Grid Search会对定义的参数空间进行暴力搜索。网格搜索速度慢,但在搜索整个搜索空间方面效果很好。Randomized Search是从定义的参数空间中进行采样,然后训练。随机搜索很快,但可能会错过搜索空间中的重要点。

image-20190708220335011

事实上,调参的时候,不需要遍历模型的所有参数。事实上,影响效果的往往只有其中的几个参数,一般对这些参数进行Randomized Search或者Grid Search即可。具体可以查看模型文档,或相关文献。

# Comparing randomized search and grid search for hyperparameter estimation

import numpy as np
from time import time
from scipy.stats import randint as sp_randint

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

# get some data
digits = load_digits()
X, y = digits.data, digits.target

# build a classifier
clf = RandomForestClassifier(n_estimators=20)

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")


# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

# use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_['params'])))
report(grid_search.cv_results_)
Output:
RandomizedSearchCV took 5.55 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.939 (std: 0.024)
Parameters: {'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 7, 'min_samples_split': 3}

Model with rank: 2
Mean validation score: 0.933 (std: 0.022)
Parameters: {'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 6, 'min_samples_split': 6}

Model with rank: 3
Mean validation score: 0.930 (std: 0.031)
Parameters: {'bootstrap': True, 'criterion': 'gini', 'max_depth': None, 'max_features': 6, 'min_samples_split': 6}

GridSearchCV took 16.95 seconds for 72 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.937 (std: 0.019)
Parameters: {'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 10, 'min_samples_split': 2}

Model with rank: 2
Mean validation score: 0.935 (std: 0.020)
Parameters: {'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 10, 'min_samples_split': 2}

Model with rank: 3
Mean validation score: 0.930 (std: 0.029)
Parameters: {'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 10, 'min_samples_split': 3}
from sklearn.model_selection import GridSearchCV

def func_grid_search(model, X_train, y_train, param_grid, cv=5, n_jobs=-1):
    """
    model: model instance
    para_grid: dict type, params searching grid.
    para_grid = {'n_estimators':[100,200,500],
                 'max_depth': [5, 8, 10, 15],
                 'max_features': [0.80, 0.90, 0.95],
                 'min_samples_split':[2, 5, 8],
                 'min_samples_leaf': [1, 3, 5]}
    """
    gs = GridSearchCV(model, param_grid, cv=cv, n_jobs=n_jobs)
    gs.fit(X_train, y_train.ravel())
    print(gs.best_params_) 
    print(gs.best_score_)
    # print(gs.cv_results_)
    bst_model = gs.best_estimator_
    return bst_model
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

def func_random_search(model, X_train, y_train, param_dist, n_iter=20, cv=5, n_jobs=-1):
    """
    # parameters for GridSearchCV
    # specify parameters and distributions to sample from
    param_dist = {"max_depth": [3, 5],
                  "max_features": sp_randint(1, 11),
                  "min_samples_split": sp_randint(2, 11),
                  "min_samples_leaf": sp_randint(1, 11),
                  "bootstrap": [True, False]
                 }
    """
    rs = RandomizedSearchCV(model, param_dist, n_iter=n_iter, cv=cv, n_jobs=n_jobs)
    rs.fit(X_train, y_train.ravel())
    print(rs.best_params_) 
    print(rs.best_score_)
    # print(rs.cv_results_)
    bst_model = rs.best_estimator_
    return bst_model
    

Grid Search太慢了,并不是很实用,一般会对参数空间先调一个粗粒度的格点搜索,然后根据结果进行细粒度的调整。而Randomized Search在迭代一定的次之后也可能实现较好的效果,值得更多的尝试。

3. Bayesian Optimization

Grid Search和Randomized Search虽然可以让整个调参过程自动化,但它们无法从之前的调参结果中获取信息,可能会尝试很多无效的参数空间。而贝叶斯优化,会对上一次的评估结果进行追踪,建立一个概率模型,反应超参数在目标函数上表现的概率分布,即$P(\text {score} | \text {hyperparameters})$,可用于指导下一次的参数选择。

贝叶斯优化可以更好的trade off Exploration&Exploitation,而且适用于随机、非凸、不连续方程的优化。具体过程可用一句话概括为:对目标函数建立概率模型,通过这个概率模型得到最promising的参数组合,用于最终目标函数的评估。

Sequential Model-Based Optimization(SMBO) 是贝叶斯优化更具体的表现形式,可认为它们是等价的。一般会有以下几个过程:

  1. 给定要搜索的超参数空间
  2. 定义一个目标函数用于评估优化
  3. 建立目标函数的surrogate model
  4. 建立一个评估surrogate model,作为选择超参数的标准(选择方程)
  5. 获取score和hyperparameter的样本用于更新 surrogate model

贝叶斯优化模型主要的区分是代理函数(surrogate function)的差异。surrogate model一般有 Gaussian Process, Random Forest和Tree Parzen Estimator (TPE) 这几种。常见的框架有Spearmint, Hyperopt, SMAC, MOE, BayesianOptimization, skopt等,它们的对比如下表:

library surrogate function
Spearmint Gaussian Process
Hyperopt Tree Parzen Estimator (TPE)
SMAC Random Forest

Optunity包括多种超参数调优的方法:

下面以Hyperopt为例说明贝叶斯优化的具体应用。

对于一个优化问题我们可以分为4个部分:

  1. 优化的目标函数
  2. 优化的参数空间
  3. 超参数优化方程,建立代理函数(surrogate function),并用其决定下一次尝试的参数组合
  4. Trials: 记录每次尝试的loss、参数及更多额外信息(可DIY),可以记录整个迭代的过程,用于回测。

空间定义:

from hyperopt import STATUS_OK, fmin, tpe, Trials, hp
import xgboost as xgb
import logging
from timeit import default_timer as timer
import os
from functools import partial
import pandas as pd
import numpy as np

MAX_EVALS = 100 # 迭代次数
NFOLDS = 5 # K-FOLD 
FOLDS = None # 自定义的FOLDS,优先级高于NFOLDS
BASE_DIR = os.path.dirname(os.path.abspath(__file__))

XGB_SPACE = {
    'booster': 'gbtree',
    'random_state': 2019,
    'eval_metric': 'rmse',
    'n_jobs': -1,
    'learning_rate': 0.05,
    'subsample': hp.uniform('subsample', 0.1, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.1, 1.0),
    'max_depth': hp.quniform('max_depth', 5, 30, 1),
    'gamma': hp.uniform('gamma', 0.0, 2.0),
    'min_child_weight': hp.uniform('min_child_weight', 0.0, 5.0),
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 3.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 3.0)
}

定义优化的目标函数:

def objective_base(params,
                   train_set,
                   folds=None,
                   nfold=5,
                   writetoFile=True):
    """
    Objective function for Gradient Boosting Machine Hyperparameter Optimization
    Args:
        folds: This argument has highest priority over other data split arguments.
    Return:
    """
    # Keep track of evals
    global _ITERATION
    _ITERATION += 1
    # Make sure parameters that need to be integers are integers
    for parameter_name in [
            'num_leaves', 'max_depth', 'bagging_freq', 'min_data_in_leaf',
            'min_samples_split', 'min_samples_leaf'
    ]:
        if parameter_name in params:
            params[parameter_name] = int(params[parameter_name])
    start = timer()
    logging.info(f"{_ITERATION} ITERATION")
    logging.info(f"params:\n{params}")
    cv_dict = xgb.cv(params,
                     train_set,
                     num_boost_round=5000,
                     nfold=nfold,
                     stratified=False,
                     folds=folds,
                     early_stopping_rounds=100,
                     as_pandas=False,
                     verbose_eval=10,
                     seed=0,
                     shuffle=False)
    # Extract the min rmse, Loss must be minimized
    loss = np.min(cv_dict['test-rmse-mean'])
    # Boosting rounds that returned the lowest cv rmse
    n_estimators = int(np.argmin(cv_dict['test-rmse-mean'])+1)
    run_time = timer() - start
    # Write to the csv file ('a' means append)
    if writetoFile:
        random_datetime = str(int(time.time()))
        hyper_base_path = os.path.join(BASE_DIR, 'hyperopt_output')
        trial_file = os.path.join(hyper_base_path, 'trials.csv')
        trial_file_rename = os.path.join(hyper_base_path,
                                         'trials_%s.csv' % random_datetime)
        if not os.path.exists(hyper_base_path):
            os.makedirs(hyper_base_path)
            print(
                "No trial file directory <hyperopt_output> exists, will be created..."
            )
        if os.path.exists(trial_file) and _ITERATION == 1:
            print("Trial file exists, will be renamed...")
            os.rename(trial_file, trial_file_rename)
            assert os.path.exists(
                trial_file
            ) == False, "Trial file still exists, rename failed..."
            # File to save first results
            of_connection = open(trial_file, 'w')
            writer = csv.writer(of_connection)
            # Write the headers to the file
            writer.writerow(
                ['loss', 'params', 'iteration', 'estimators', 'train_time'])
            of_connection.close()
        of_connection = open(trial_file, 'a')
        writer = csv.writer(of_connection)
        writer.writerow([loss, params, _ITERATION, n_estimators, run_time])
    # Dictionary with information for evaluation
    return {
        'loss': loss,
        'params': params,
        'iteration': _ITERATION,
        'estimators': n_estimators,
        'train_time': run_time,
        'status': STATUS_OK
    }

定义前处理和后处理模块:

def build_train_set(X_train, y_train):
    isX_df = isinstance(X_train, pd.DataFrame)
    isY_sr = isinstance(y_train, pd.Series)
    isY_df = isinstance(y_train, pd.DataFrame)
    if isY_df:
        raise TypeError(
            f"y_train is df, with the shape {y_train.shape}, which is not supportable now."
        )
    if isX_df ^ isY_sr:
        raise TypeError(f"X_train and y_train have different types!")
    if isX_df:
        train_set = xgb.DMatrix(X_train.values, y_train.values)
    else:
        train_set = xgb.DMatrix(X_train, y_train)
    return train_set
 
def post_hyperopt(bayes_trials, train_set, folds=None, nfold=5):
    # get best params
    bayes_results = pd.DataFrame(bayes_trials.results)
    bayes_results = bayes_results.sort_values(by='loss')
    bayes_results.reset_index(drop=True, inplace=True)
    best_params = bayes_results.loc[0, 'params']
    # get best loss and trees
    best_params['learning_rate'] = 0.01
    # Perform n_folds cross validation
    cv_dict = xgb.cv(best_params,
                     train_set,
                     num_boost_round=5000,
                     folds=folds,
                     nfold=nfold,
                     stratified=False,
                     shuffle=False,
                     early_stopping_rounds=100,
                     as_pandas=False,
                     verbose_eval=10,
                     seed=2019)
    # Extract the min rmse, Loss must be minimized
    loss = np.min(cv_dict['test-rmse-mean'])
    # Boosting rounds that returned the lowest cv rmse
    n_estimators = int(np.argmin(cv_dict['test-rmse-mean']) + 1)
    best_params['n_estimators'] = n_estimators
    logging.info(f"best loss: {loss}, best n_estimators: {n_estimators}")
    logging.info(f"best params: {best_params}")
    return best_params, loss

定义主函数:

  
def main_tuning_with_bo(X_train,
                        y_train,
                        max_evals=MAX_EVALS,
                        folds=FOLDS,
                        nfold=NFOLD):
    # Keep track of results
    bayes_trials = Trials()
    # Global variable
    global _ITERATION
    _ITERATION = 0
    TRAIN_SET = build_train_set(X_train, y_train)
    SPACE = XGB_SPACE
    func_objective = partial(objective_base,
                             train_set=TRAIN_SET,
                             folds=folds,
                             nfold=nfold,
                             writetoFile=True)
    # Run optimization
    best = fmin(fn=func_objective,
                space=SPACE,
                algo=tpe.suggest,
                max_evals=max_evals,
                trials=bayes_trials,
                rstate=np.random.RandomState(2019))
    best_params, loss = post_hyperopt(bayes_trials,
                                      train_set=TRAIN_SET,
                                      folds=folds,
                                      nfold=nfold)
    return best_params, loss

Reference