Catboost入门介绍与实例。

用过sklearn进行机器学习的同学应该都知道,在用sklearn进行机器学习的时候,我们需要对类别特征进行预处理,如label encoding, one hot encoding等,因为sklearn无法处理类别特征,会报错。

而俄罗斯Yandex公司开源的 CatBoost模型可直接对类别特征进行处理,在很多公开数据集上的表现都相当优异。从它的名字也可以看出来(CatBoost = Category and Boosting),它的优势是对类别特征的处理,同时结果更加robust,不需要费力去调参也能获得非常不错的结果,关于调参可参考链接

catboost:

Attention. Do not use one-hot encoding during preprocessing. This affects both the training speed and the resulting quality.

1. Install

首先安装相应的工具:

# 用pip
pip install catboost
# 或者用conda
conda install -c conda-forge catboost 

# 安装jupyter notebook中的交互组件,用于交互绘图
pip install ipywidgets 
jupyter nbextension enable --py widgetsnbextension

2. Preprocessing

Pool

Pool是catboost中的用于组织数据的一种形式,也可以用numpy array和dataframe。但更推荐Pool,其内存和速度都更优。

关于Pool的用法:

class Pool(data, 
           label=None,
           cat_features=None,
           column_description=None,
           pairs=None,
           delimiter='\t',
           has_header=False,
           weight=None, 
           group_id=None,
           group_weight=None,
           subgroup_id=None,
           pairs_weight=None
           baseline=None,
           feature_names=None,
           thread_count=-1)
from catboost import CatBoostClassifier, Pool

train_data = Pool(data=[[1, 4, 5, 6],
                        [4, 5, 6, 7],
                        [30, 40, 50, 60]],
                  label=[1, 1, -1],
                  weight=[0.1, 0.2, 0.3])
train_data 
# <catboost.core.Pool at 0x1a22af06d0>

model = CatBoostClassifier(iterations=10)
model.fit(train_data)
preds_class = model.predict(train_data)

FeaturesData

创建Pool有多种方式,而通过FeaturesData构建Pool是更优的方式。

class FeaturesData(num_feature_data=None,
                   cat_feature_data=None,
                   num_feature_names=None,
                   cat_feature_names=None)

CatBoostClassifier with FeaturesData:

import numpy as np
from catboost import CatBoostClassifier, FeaturesData
# Initialize data
cat_features = [0,1,2]
train_data = FeaturesData(
    num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32),
    cat_feature_data=np.array([["a", "b"], ["a", "b"], ["c", "d"]], dtype=object)
)
train_labels = [1,1,-1]
test_data = FeaturesData(
    num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32),
    cat_feature_data=np.array([["a", "b"], ["a", "d"]], dtype=object))

# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss')
# Fit model
model.fit(train_data, train_labels)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')

CatBoostClassifier with Pool and FeaturesData:

import numpy as np
from catboost import CatBoostClassifier, FeaturesData, Pool
# Initialize data
train_data = Pool(
    data=FeaturesData(
        num_feature_data=np.array([[1, 4, 5, 6], 
                                   [4, 5, 6, 7], 
                                   [30, 40, 50, 60]], 
                                   dtype=np.float32),
        cat_feature_data=np.array([["a", "b"], 
                                   ["a", "b"], 
                                   ["c", "d"]], 
                                   dtype=object)
    ),
    label=[1, 1, -1]
)
test_data = Pool(
    data=FeaturesData(
        num_feature_data=np.array([[2, 4, 6, 8], 
                                   [1, 4, 50, 60]], 
                                   dtype=np.float32),
        cat_feature_data=np.array([["a", "b"], 
                                   ["a", "d"]], 
                                   dtype=object)
    )
)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations = 2, 
                           learning_rate = 1,
                           depth = 2, 
                           loss_function = 'Logloss')
# Fit model
model.fit(train_data)
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')

3. Case

下面利用catboost内置的titanic数据集做演示。

库和数据集准备

首先导入必要的库和做数据准备,这里忽略最为重要的特征工程部分,仅仅作为演示:

from catboost.datasets import titanic
import numpy as np
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, Pool, cv
from sklearn.metrics import accuracy_score

# 导入数据
train_df, test_df = titanic()

# 查看缺测数据:
null_value_stats = train_df.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

# 填充缺失值:
train_df.fillna(-999, inplace=True)
test_df.fillna(-999, inplace=True)

# 拆分features和label
X = train_df.drop('Survived', axis=1)
y = train_df.Survived

# train test split
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)
X_test = test_df

# indices of categorical features
categorical_features_indices = np.where(X.dtypes != np.float)[0]

进行模型训练

catboost提供的默认参数可以提供非常好的baseline。所以不妨从默认参数开始。

model = CatBoostClassifier(
    custom_metric=['Accuracy'],
    random_seed=666,
    logging_level='Silent'
)
# custom_metric <==> custom_loss

model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
    logging_level='Verbose',  # you can comment this for no text output
    plot=True
);

# OUTPUT:
"""
...
...
...
bestTest = 0.3792389991
bestIteration = 342

Shrink model to first 343 iterations.
"""

image-20190808164800993

应用模型进行预测

predictions = model.predict(X_test)
predictions_probs = model.predict_proba(X_test)
print(predictions[:10])
print(predictions_probs[:10])
# OUTPUT:
"""
[0. 0. 0. 0. 1. 0. 1. 0. 1. 0.]
[[0.90866781 0.09133219]
 [0.63668717 0.36331283]
 [0.95333247 0.04666753]
 [0.91051481 0.08948519]
 [0.28010084 0.71989916]
 [0.94618962 0.05381038]
 [0.35536101 0.64463899]
 [0.81843278 0.18156722]
 [0.32829247 0.67170753]
 [0.92653732 0.07346268]]
"""

选择最好的模型输出(use_best_model)

在进行模型训练的时候,use_best_model最好用默认设置True,这意味着最后的模型训练结果是收缩在最佳的迭代次数上的(可以用model.tree_count_获得最佳的迭代次数),如果use_best_model设置为False,则 model.tree_count_ = iteration。 如下面的例子:

# 数据准备的部分见库和数据集准备部分
params = {
    'iterations': 500,
    'learning_rate': 0.1,
    'eval_metric': 'Accuracy',
    'random_seed': 666,
    'logging_level': 'Silent',
    'use_best_model': False
}
# train
train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
# validation
validate_pool = Pool(X_validation, y_validation, cat_features=categorical_features_indices)

# train with 'use_best_model': False
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=validate_pool)

# train with 'use_best_model': True
best_model_params = params.copy()
best_model_params.update({'use_best_model': True})
best_model = CatBoostClassifier(**best_model_params)
best_model.fit(train_pool, eval_set=validate_pool);

# show result
print('Simple model validation accuracy: {:.4}, and the number of trees: {}'.format(
    accuracy_score(y_validation, model.predict(X_validation)), model.tree_count_))
print('')
print('Best model validation accuracy: {:.4}, and the number of trees: {}'.format(
    accuracy_score(y_validation, best_model.predict(X_validation)),best_model.tree_count_))

用Early Stopping防止过拟合、节约训练时间

earlystopping是常用的防止模型过拟合的方式,同时也可以大幅度的节约训练时间。

params.update({'iterations':1000})
params
# OUTPUT:
"""
{'iterations': 1000,
 'learning_rate': 0.1,
 'eval_metric': 'Accuracy',
 'random_seed': 42,
 'logging_level': 'Silent',
 'use_best_model': False}
"""
%%time
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=validate_pool)
"""
CPU times: user 2min 11s, sys: 52.1 s, total: 3min 3s
Wall time: 27.8 s
"""
%%time
earlystop_model_1 = CatBoostClassifier(**params)
earlystop_model_1.fit(train_pool, eval_set=validate_pool, early_stopping_rounds=200, verbose=20)
"""
CPU times: user 46.6 s, sys: 15.6 s, total: 1min 2s
Wall time: 9.2 s
"""
%%time
earlystop_params = params.copy()
earlystop_params.update({
    'od_type': 'Iter',
    'od_wait': 200,
    'logging_level': 'Verbose'    
})
earlystop_model_2 = CatBoostClassifier(**earlystop_params)
earlystop_model_2.fit(train_pool, eval_set=validate_pool);
"""
CPU times: user 49.6 s, sys: 19.9 s, total: 1min 9s
Wall time: 10.3 s
"""

也可以直接设置参数early_stopping_rounds:

early_stopping_rounds:

Set the overfitting detector type to ‘Iter’ ( ‘od_type’: ‘Iter’) and stop the training after the specified number of iterations since the iteration with the optimal metric value.

earlystop_params = params.copy()
earlystop_params.update({
    'early_stopping_rounds': 200,
    'logging_level': 'Verbose'    
})

输出结果:

print('Simple model tree count: {}'.format(model.tree_count_))
print('Simple model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, model.predict(X_validation))
))
print('')
print('Early-stopped model 1 tree count: {}'.format(earlystop_model_1.tree_count_))
print('Early-stopped model 1 validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, earlystop_model_1.predict(X_validation))
))
print('')
print('Early-stopped model 2 tree count: {}'.format(earlystop_model_2.tree_count_))
print('Early-stopped model 2 validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, earlystop_model_2.predict(X_validation))
))

"""
Simple model tree count: 1000
Simple model validation accuracy: 0.8206

Early-stopped model 1 tree count: 393
Early-stopped model 1 validation accuracy: 0.8296

Early-stopped model 2 tree count: 393
Early-stopped model 2 validation accuracy: 0.8296
"""

可以看到用earlystopping后训练时间更短,可以有效避免过拟合,得到的模型准确率更高。

Feature Importance

显示特征重要性:

model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)

feature_importances = model.get_feature_importance(train_pool)
feature_names = X_train.columns
for score, name in sorted(zip(feature_importances, feature_names), reverse=True):
    print('{}: {}'.format(name, score))
    
"""
Sex: 48.21061102095765
Pclass: 17.045040317206695
Age: 7.611166250335819
Parch: 5.220861205417323
SibSp: 5.16579933751564
Embarked: 4.968165121183137
Fare: 4.858908301370388
Cabin: 4.140024994004162
Ticket: 2.7794234520091585
PassengerId: 0.0
Name: 0.0
"""
# 设置参数:prettified=True 获得更多的输出信息
importances = model.get_feature_importance(prettified=True)
print(importances)

封装函数,实现更好的显示方式。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=2)
%matplotlib inline

def func_plot_importance(df_imp):

    sns.set(font_scale=1)
    fig = plt.figure(figsize=(3, 3), dpi=100)
    ax = sns.barplot(
        x="Importance", y="Features", data=df_imp, label="Total", color="b")
    ax.tick_params(labelcolor='k', labelsize='10', width=3)
    plt.show()

def display_importance(model_out, columns, printing=True, plotting=True):
    importances = model_out.feature_importances_
    indices = np.argsort(importances)[::-1]
    importance_list = []
    for f in range(len(columns)):
        importance_list.append((columns[indices[f]], importances[indices[f]]))
        if printing:
            print("%2d) %-*s %f" % (f + 1, 30, columns[indices[f]],
                                    importances[indices[f]]))
    if plotting:
        df_imp = pd.DataFrame(
            importance_list, columns=['Features', 'Importance'])
        func_plot_importance(df_imp)
        

display_importance(model_out=model, columns=X_train.columns)

Cross Validation

cv(pool=None, 
   params=None, 
   dtrain=None, 
   iterations=None, 
   num_boost_round=None,
   fold_count=3, 
   nfold=None,
   inverted=False,
   partition_random_seed=0,
   seed=None, 
   shuffle=True, 
   logging_level=None, 
   stratified=None,
   as_pandas=True,
   metric_period=None,
   verbose=None,
   verbose_eval=None,
   plot=False,
   early_stopping_rounds=None,
   folds=None)

需要先将数据封装Pool里,然后再进行交叉验证。

cv_params = model.get_params()
cv_params.update({
    'loss_function': 'Logloss'
})
cv_data = cv(
    Pool(X, y, cat_features=categorical_features_indices),
    cv_params,
    plot=True
)

print('Best validation accuracy score: {:.3f}±{:.3f} on step {}'.format(
    np.max(cv_data['test-Accuracy-mean']),
    cv_data['test-Accuracy-std'][np.argmax(cv_data['test-Accuracy-mean'])],
    np.argmax(cv_data['test-Accuracy-mean'])))
# Best validation accuracy score: 0.833±0.007 on step 286
best_value = np.min(np.array(cv_data['test-Logloss-mean']))
best_iter_idx = np.argmin(np.array(cv_data['test-Logloss-mean']))

print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format(
    best_value,
    cv_data['test-Logloss-std'][best_iter_idx],
    best_iter_idx+1))

注意:iteration = index+1

用holdout做验证容易低估或高估我们的模型预测偏差,用交叉验证是更好的方式。

Using Baseline

可以实现在之前预训练的基础上继续训练。

params = {'iterations': 200,
          'learning_rate': 0.1,
          'eval_metric': 'Accuracy',
          'random_seed': 42,
          'logging_level': 'Verbose',
          'use_best_model': False}

current_params = params.copy()
current_params.update({
    'iterations': 10
})
model = CatBoostClassifier(**current_params).fit(X_train, y_train, categorical_features_indices)
# Get baseline (only with prediction_type='RawFormulaVal')
baseline = model.predict(X_train, prediction_type='RawFormulaVal')
# Fit new model
model.fit(X_train, y_train, categorical_features_indices, baseline=baseline);

Snapshot

可用于在中断后恢复之前训练状态,以及在之前训练的基础上进行继续训练。假如我们的训练会持续较长时间,设置snapshot可以有效防止我们的电脑或者服务器在过程中重启或者其他故障而导致我们的训练前功尽弃。

params_with_snapshot = params.copy()
params_with_snapshot.update({
    'iterations': 5,
    'learning_rate': 0.5,
    'logging_level': 'Verbose'
})
model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)

params_with_snapshot.update({
    'iterations': 10,
    'learning_rate': 0.1,
})
model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)

训练的中间信息会默认保存在catboost_info/目录下,如需修改可以通过train_dir参数进行设置。

#!rm 'catboost_info/snapshot.bkp'
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=40,
    random_seed=43
)
model.fit(
    train_pool,
    eval_set=validate_pool,
    save_snapshot=True,
    snapshot_file='snapshot.bkp',
    logging_level='Verbose'
)

DIY Loss AND Metric Function

注意区分两个参数:

(1) loss_function, Alias: objective.

训练模型的优化目标函数。

(2) custom_metric, Alias: custom_loss

在训练时输出的评估指标,仅作为模型训练状态的参照,而非实际的优化目标。

(3) eval_metric

用于监测模型过拟合以及作为选择最优模型的参考。(loss_functioneval_metric可以不一致,比如训练用Logloss,用AUC选择最佳模型/最佳迭代次数)

model = CatBoostClassifier(
    iterations=500,
    loss_function= 'Logloss',
    custom_metric=['Accuracy','AUC'],
    eval_metric='F1',
    random_seed=666
)

# custom_metric <==> custom_loss
# 只作为评估参考,而非优化目标

model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
    verbose=50,
    plot=True
);

image-20190810232551298

不同参数的测试:

# custom_metric=['Accuracy','AUC'], eval_metric='F1',
model.best_iteration_, model.best_score_, model.tree_count_
"""
(219,
 {'learn': {'Accuracy': 0.9491017964071856,
   'Logloss': 0.1747009677350333,
   'F1': 0.9294605809128631},
  'validation': {'Accuracy': 0.8385650224215246,
   'Logloss': 0.39249638575985446,
   'F1': 0.7906976744186046,
   'AUC': 0.9018111688747275}},
 220)
"""

# custom_metric=['Accuracy','AUC'], eval_metric='Logloss',
model.best_iteration_, model.best_score_, model.tree_count_    
"""
(152,
 {'learn': {'Accuracy': 0.9491017964071856, 'Logloss': 0.1747009677350333},
  'validation': {'Accuracy': 0.8385650224215246,
   'Logloss': 0.39249638575985446,
   'AUC': 0.9018111688747275}},
 153)
"""

# custom_metric=['Accuracy','AUC'], eval_metric='Accuracy',
model.best_iteration_, model.best_score_, model.tree_count_    
"""
(219,
 {'learn': {'Accuracy': 0.9491017964071856, 'Logloss': 0.1747009677350333},
  'validation': {'Accuracy': 0.8385650224215246,
   'Logloss': 0.39249638575985446,
   'AUC': 0.9018111688747275}},
 220)
"""
1. User Defined Objective Function
class LoglossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        """
        approxes, targets, weights are indexed containers of floats
        (containers which have only __len__ and __getitem__ defined).
        weights parameter can be None.
        
        To understand what these parameters mean, assume that there is
        a subset of your dataset that is currently being processed.
        approxes contains current predictions for this subset,
        targets contains target values you provided with the dataset.
        
        This function should return a list of pairs (der1, der2), where
        der1 is the first derivative of the loss function with respect
        to the predicted value, and der2 is the second derivative.
        
        In our case, logloss is defined by the following formula:
        target * log(sigmoid(approx)) + (1 - target) * (1 - sigmoid(approx))
        where sigmoid(x) = 1 / (1 + e^(-x)).
        """
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)
        result = []
        for index in range(len(targets)):
            e = np.exp(approxes[index])
            p = e / (1 + e)
            der1 = (1 - p) if targets[index] > 0.0 else -p
            der2 = -p * (1 - p)
            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]
            result.append((der1, der2))
        return result

model = CatBoostClassifier(
    iterations=10,
    random_seed=42, 
    loss_function=LoglossObjective(), 
    eval_metric="Logloss"
)
# Fit model
model.fit(train_pool)
# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`
preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')
2. User Defined Metric Function
class LoglossMetric(object):
    def get_final_error(self, error, weight):
        return error / (weight + 1e-38)

    def is_max_optimal(self):
        return False

    def evaluate(self, approxes, target, weight):
        """        
        approxes is a list of indexed containers
        (containers with only __len__ and __getitem__ defined),
        one container per approx dimension.
        Each container contains floats.
        weight is a one dimensional indexed container.
        target is float.
        
        weight parameter can be None.
        Returns pair (error, weights sum)
        """
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])
        approx = approxes[0]
        error_sum = 0.0
        weight_sum = 0.0
        for i in range(len(approx)):
            w = 1.0 if weight is None else weight[i]
            weight_sum += w
            error_sum += -w * (target[i] * approx[i] - np.log(1 + np.exp(approx[i])))

        return error_sum, weight_sum

model = CatBoostClassifier(
    iterations=10,
    random_seed=42, 
    loss_function="Logloss",
    eval_metric=LoglossMetric()
)
# Fit model
model.fit(train_pool)
# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`
preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

训练后查看模型在新数据集上的表现(Eval Metrics)

CatBoost有一个eval_metrics的方法,可以用于计算训练后的模型某一指定指标在每一轮迭代的表现,同时也可以可视化。可用于训练后的模型在新数据集上的评估。

model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)
eval_metrics = model.eval_metrics(validate_pool, ['AUC','F1','Logloss'], plot=True)
# 返回一个dict,有'AUC','F1','Logloss'这几个键

image-20190809184201901

对比不同参数配置下模型的学习过程

from catboost import MetricVisualizer

model1 = CatBoostClassifier(iterations=100, depth=5, train_dir='model_depth_5/', logging_level='Silent')
model1.fit(train_pool, eval_set=validate_pool)

model2 = CatBoostClassifier(iterations=100, depth=8, train_dir='model_depth_8/', logging_level='Silent')
model2.fit(train_pool, eval_set=validate_pool);

widget = MetricVisualizer(['model_depth_5', 'model_depth_8'])
widget.start()

image-20190809185049769

保存和导入模型

将模型保存为二进制文件。

model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)
model.save_model('catboost_model.dump')
model = CatBoostClassifier()
model.load_model('catboost_model.dump');

print(model.get_params())
print(model.random_seed_)
print(model.learning_rate_)

模型的分析与理解

shap

调参

我们可以通过交叉验证和learning curve得到最佳的iterations (boosting steps),但还有一些重要的参数需要我们额外调整。较为重要的比如l2_leaf_reg, learning_rate等,更多的参数说明请参考官网。下面用hyperopt进行调参演示:

import hyperopt
from catboost import CatBoostClassifier, Pool, cv

def hyperopt_objective(params):
    
    model = CatBoostClassifier(
        l2_leaf_reg=int(params['l2_leaf_reg']),
        learning_rate=params['learning_rate'],
        iterations=100,
        eval_metric='Accuracy',
        loss_function= 'Logloss',
        random_seed=42,
        logging_level='Silent'
    )
    
    cv_data = cv(
        Pool(X, y, cat_features=categorical_features_indices),
        model.get_params()
    )
    best_accuracy = np.max(cv_data['test-Accuracy-mean'])
    
    return 1 - best_accuracy # as hyperopt minimises
from numpy.random import RandomState

params_space = {
    'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1),
    'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),
}

trials = hyperopt.Trials()

best = hyperopt.fmin(
    hyperopt_objective,
    space=params_space,
    algo=hyperopt.tpe.suggest,
    max_evals=10,
    trials=trials,
    rstate=RandomState(123)
)

print(best)

"""
100%|██████████| 10/10 [01:02<00:00,  6.69s/it, best loss: 0.1728395061728395]
{'l2_leaf_reg': 3.0, 'learning_rate': 0.36395429572850696}
"""
model = CatBoostClassifier(
    l2_leaf_reg=int(best['l2_leaf_reg']),
    learning_rate=best['learning_rate'],
    iterations=100,
    eval_metric='Accuracy',
    loss_function= 'Logloss',
    random_seed=42,
    logging_level='Silent'
)
cv_data = cv(Pool(X, y, cat_features=categorical_features_indices), model.get_params())

print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean'])))
print(f"Best iteration: {int(np.argmax(cv_data['test-Accuracy-mean'])+1)}")

"""
Precise validation accuracy score: 0.8271604938271605
Best iteration: 49
"""

一些常用参数的说明,更多参数请查阅官网文档Python Training Parameters

  1. iterations + learning_rate

默认状况下会迭代1000次, learning_rate是根据数据集及iterations参数自动定义的。如果减小iterations,最好相应的增大 learning_rate,使得结果收敛。

如果在训练中发现结果没收敛,可以考虑提高 learning_rate;如果发现过拟合了,则需要减小 learning_rate

  1. boosting_type

默认是Ordered,效果不错,在小数据集推荐使。但是速度会比Plain模式慢。

  1. bootstrap_type

  2. one_hot_max_size

    在类别特征转换时,对取值少于或等于one_hot_max_size的类别特征,采用OneHot编码,对其他类别特征采用更多统计值。通常OneHot是更快的方式,而计算统计值耗时更多,所以为了提高速度,我们可以给该参数设置较大的值。

  3. rsm: Alias: colsample_bylevel, float(0,1]

    参与每次分裂选择的特征比例。在你有好几百维以上特征的情况下,这个参数非常有效,可以有效的加速训练同时保持较好的结果。如果特征较少,可以不用该参数。

    假设你有很多的特征,你设置了rsm=0.1,通常你需要增加20%的迭代次数使得模型收敛,但是每次的迭代速度将会比原来快10倍。

  4. max_ctr_complexity

    特征组合的最大特征数量。catboost用贪心算法做类别特征的特征组合,非常耗时。设置 max_ctr_complexity = 1 取消特征组合,设置 max_ctr_complexity = 2 只做两个特征的组合。

  5. depth

树深。大多数情况下,在4-10之间,可以在6-10之间多加调试。

  1. l2_leaf_reg

    L2正则系数,多尝试不同的取值。

  2. random_strength

    可以防止过拟合。在分裂过程计算各特征score时加入的随机因子。本来score是确定性的,我们加入一个满足均值为0,方差为1*random_strength(方差随着迭代减小)分布的误差项来产生随机性,防止过拟合。

  3. bagging_temperature: [0,inf)

    bootstrap_typeBayesian时有效,用于设置Bayesian bootstrap的参数。当取值为1时,会从指数分布中采样权值;当为0时,所有的权重为1。这个值越大,则bootstrap越aggressive。

  4. has_time

    如果数据集是时间序列,需要考虑样本的先后关系,则可以设置该参数。则Transforming categorical features to numerical featuresChoosing the tree structure 的阶段,数据会保持原有顺序,或根据Timestamp的列排列(如果在input data中声明),而不会进行shuffle操作 (random permutations)。

  5. grow_policy: 可选值为 [SymmetricTree, Depthwise, Lossguide]

    决策树的生长方式,默认是level-wise的symmetric trees。

    min_data_in_leaf Alias: min_child_samples: 支持Depthwise, Lossguide

    max_leaves Alias: num_leaves: 支持Lossguide

如果是GPU环境:可以设置task_type="GPU"

border_count: Alias: max_bin. 对数值型特征的切分次数,在CPU上默认值为254,在GPU上默认值为128。在CPU上该参数不会显著影响到训练速度,在GPU上该参数会显著影响到训练的速度,如果为了更好的训练质量可以设置为254,如果为了更快,可以降低该参数的值。

例如:

更快的模型
from catboost import CatBoost
fast_model = CatBoostClassifier(
    random_seed=63,
    iterations=150,
    learning_rate=0.01,
    boosting_type='Plain',
    bootstrap_type='Bernoulli',
    subsample=0.5,
    one_hot_max_size=20,
    rsm=0.5,
    leaf_estimation_iterations=5,
    max_ctr_complexity=1,
    border_count=32)

fast_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    plot=True
)
更准确的模型
tunned_model = CatBoostClassifier(
    random_seed=63,
    iterations=1000,
    learning_rate=0.03,
    l2_leaf_reg=3,
    bagging_temperature=1,
    random_strength=1,
    one_hot_max_size=2,
    leaf_estimation_method='Newton',
    depth=6
)
tunned_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    eval_set=(X_validation, y_validation),
    plot=True
)

REFERENCE