今年参加ESoWC 2019写的一个Proposal: An AutoML Framework for Predicting Extreme Weather Hazards.
Selected challenge: Challenge #12 - Machine learning for predicting extreme weather hazards
Summary and Motivation
Extreme weather hazards may always cause significant disruptions, since they are really hard to be precisely predicted. And with global warming, meteorological disasters are seemingly becoming more and more frequent. Weather systems lead to these severe weathers are pretty much in meso or micro scale, thus it can hardly be caught by low-resolution global numerical weather forecasting. It’s a real headache for meteorologists. Based on the big data from ECMWF, we propose to predict these extreme weather hazards with machine learning. Although a few applications of machine learning or deep learning in predicting sever weathers can be found in both academia and industry[2-5], there are still no general and open-source framework to solve this kind of problems.
1.2 Our Plans
Machine learning is powerful but tricky for new starters and searching for a candidate model with good performance is really knowledge- and labor- intensive. With the purpose of making machine learning to be used for solving real world problems easily, AutoML(Automated Machine Learning) is emerging and becoming a real hot topic contemporarily.
We are going to build a general AutoML framework specially for extreme weather hazards prediction, named EWH-AutoML(Extreme Weather Hazards AutoML). But at first, we will focus on the extreme precipitation prediction in the coming 3 hours. It will be used as testing meanwhile for us to polish our AutoML framework. Finally, we’ll transfer the trained models and the framework to solve other extreme weather hazards prediction problems in order to verify the effectiveness of our EWH-AutoML framework.
Solution to The Challenge
EWH-AutoML framework mainly consists of 4 modules, which are Data Preprocessing, Feature Engineering, Features and Models Searching and Model Ensemble. The whole structure of the framework and the module interactions are shown in below two figures separately, and detailed introduction will be stated in the following Section 2.1~2.5.
Figure 1. Workflow of EWH-AutoML framework
Figure 2. Structure and the module interactions of EWH-AutoML framework
2.1 Data Processing
First of all, we will extract the raw meteorological and climatic data needed for generating sufficient features from original GRIB or NetCDF files and transform these data into unified DataFrame format through Pandas. Although a lot of python packages existing can help us do the tedious work of data extracting and formatting, it’s still not that convenient because it’s almost always time-consuming and error-prone especially for those who are not that familiar with data in Grib or NetCDF format.
In our data processing module, we will make a fine-art wrapper based on some of the existing packages, and offer simple API to extract and format the raw data into DataFrame through parallel computing. In contrast to the existing other packages, big files and multiple files processing concurrently are well supported in our module. And for convenience, raw data will be stored in Amazon S3.
The extreme weather hazards are quite infrequent in the real world, so the data is extremely imbalanced. We should handle the imbalanced dataset by resampling at first. In contrast to simple oversampling or undersampling, SMOTE8 algorithm will be used, which is an oversampling method but it creates samples by synthesizing from the minor class not simplely by duplicating.
2.2 Feature Engineering
Feature engineering could be a quite tricky but crucial part for machine learning. As we can see, most of the winners in the data mining competitions usually spend most of their time doing feature engineering. Manual feature engineering work is obviously tedious and inefficient, thus we are going to build an automated feature engineering module to generate a bunch of features and their high-order combinations.
All of the features will be classified and saved in different tables of the same database, this is our feature pool for building models. After feature engineering, the time series prediction problem will be finally transformed into a supervised machine learning problem.
Our features mainly include these following types:
(1) Basic data of meteorological or climatic variables of different levels, such as Temperature, Humidity, Pressure, U and V, etc. Both ECMWF’s historical and forecasting data are important according to our former experience, so if possible, we hope that we could get the two together.
(2) Derived features from (1). We will do a lot of processing and derive more features based on the features of (1), such as polynomial features, statistical features, binning features, etc.
(3) Temporal features such as Year, Season, Month, Day of the Year, Week of the Year, etc.
(4) Spatial features mainly include the features of (1) from the neighboring areas as shown in Figure 3. The orange point in the center represents our target location to make the prediction. The basic meteorological data of the surrounding 24 sectors will be extracted as spatial features. Location info like Latitude, longitude, altitude, province/state, district, etc. will be also included as spatial features.
Figure 3. The neighboring 24 sectors for generating spatial features
(5) Lag features and 1-D convolutional features. We will add prediction or observation data (if possible) of the past few hours and extract 1-D convolutional features from the time series.
(6) Trend features and gradient features will stand for the temporal and spatial variations of the features of (1). It’s possibly of great importance for a changeable weather systems.
Although some models like LightGBM and CatBoost can handle categorical features directly, to make it suitable for all of our models, we will do One-Hot encoding, label encoding or mean encoding for categorical features consistently.
2.3 Searching Features and Models
After automated feature engineering, feature selection will be in process before building models. It can help us filter unimportant features to save computing time and avoid overfitting. There are a lot of ways to do feature selection although, such as forward feature selection, recursive feature elimination, etc. To do feature selection, we will load a subset of features from our feature pool and transform the feature dataset into Numpy ndarray format for model training at first.
Considering the huge data size, we will do fast feature selection based on LightGBM ’s feature importance, which measures the increase of the prediction error after the feature’s values are permuted. And we will choose the top 10% important features to do feature combination. Besides selecting by feature importance, we will tune the L1 and L2 regularization parameters to filter the unimportant features or assign them less weight.
After feature selection, we will build various models and search for the suitable hyper-parameters for each model. In contrast to the exhaustive grid search method, we prefer using Bayesian optimization algorithm. Finally, We will evaluate our models on the same test set, qualified models whose evaluation scores are better than the threshold we set before will be stored in Amazon S3 for final ensemble.
2.4 Model Ensemble
Ensemble learning will probably outperform each single model by decreasing the variance and bias. It’s a kind of meta-algorithm that combines several machine learning models together. As is known to us all, bagging and boosting are two common ways of ensemble learning. Besides these, blending and stacking are used frequently. We will use stacking method to generate our final model. Our stacking method is shown in Figure 4. At first, the whole dataset will be split into 5-fold to do 5 folds cross validation prediction with our model candidates in Amazon S3, and then the 5-fold prediction will be comined together for meta-learning. The reason for generating 5-fold cross validation prediction rather than predicting directly is to prevent overfitting. With a bunch of predictions, we could train the stacking model for final prediction.
Figure 4. The structure of stacker
2.5 Transfer Learning Testing
At last, we’ll do transfer learning with the trained extreme precipitation prediction model and our EWH-AutoML framework. Transfer Learning mentioned here not only means that we can use our trained model as base model to fine tune and generate other extreme weather hazards prediction models, which is the commonly known “transfer learning”, but also means that we’ll try to predict other extreme weather hazards with our EWH-AutoML framework.
Such as the classic trained models in computer vision realm like VGG, Inception, ResNet, etc., could be used to do transfer learning to save a lot of time and to solve the problems elegantly with limited data. For us, the main advantage of doing transfer learning is to save a lot of time and possibly acquire better results. We can use our trained model’s parameters to initialize new models, making them warm start, and then use our EWH-AutoML framework to retrain the final model.
We will spend nearly two months to build the EWH-AutoML framework. And rest two months will be spent to do validation and optimization of the framework.
- In May, we’ll focus on building the first 2 of 4 major modules of EWH-AutoML framework: Data Preprocessing and Feature Engineering. Data and derived features are of great importance in almost every machine learning algorithm, so we are going to spend a whole month to manage.
- In June, we will finish the rest two modules: Features and Models Searching and Model Ensemble. And a baseline model is supposed to be finished with the EWH-AutoML framework at this stage.
- The main purpose of July is obviously building models, tuning parameters and stacking candidate models into a more robust one. Several candidate models and the final stacking model will be ready before the end of the month. And framework optimization will be done continuously in process.
- In August, last month, we’re going to refactor our EWH-AutoML framework to make it much more robust and try transfer learning with the trained models from last stage meanwhile.
Figure 5. Our proposed milestones
 Rahmstorf S, Coumou D. Increase of extreme events in a warming world[J]. Proceedings of the National Academy of Sciences, 2011, 108(44): 17905-17909.
 McGovern A, Elmore K L, Gagne D J, et al. Using artificial intelligence to improve real-time decision-making for high-impact weather[J]. Bulletin of the American Meteorological Society, 2017, 98(10): 2073-2090.
 Nayak M A, Ghosh S. Prediction of extreme rainfall event using weather pattern recognition and support vector machine classifier[J]. Theoretical and applied climatology, 2013, 114(3-4): 583-603.
 Gagne D J, McGovern A, Haupt S E, et al. Storm-based probabilistic hail forecasting with machine learning applied to convection-allowing ensembles[J]. Weather and Forecasting, 2017, 32(5): 1819-1840.
 McGovern A, Gagne D J, Williams J K, et al. Enhancing understanding and improving prediction of severe weather through spatiotemporal relational learning[J]. Machine learning, 2014, 95(1): 27-50.
 Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of artificial intelligence research, 2002, 16: 321-357.
 Domingos P M. A few useful things to know about machine learning[J]. Commun. acm, 2012, 55(10): 78-87.
 Snoek J, Larochelle H, Adams R P. Practical bayesian optimization of machine learning algorithms[C]//Advances in neural information processing systems. 2012: 2951-2959.
 Zhou Z H. Ensemble methods: foundations and algorithms[M]. Chapman and Hall/CRC, 2012.
 Torrey L, Shavlik J. Transfer learning[M]//Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI Global, 2010: 242-264.
 Shi X, Gao Z, Lausen L, et al. Deep learning for precipitation nowcasting: A benchmark and a new model[C]//Advances in Neural Information Processing Systems. 2017: 5617-5627.