Advanced Guide

For a basic introduction to GAMA, read the User Guide first. This section will cover more advanced usage of GAMA, in particular it covers:

  • Ways to configure GAMA, such as:
    • A description of non-default AutoML steps and how to configure them.

    • Configuring the search space.

  • Interfacing with GAMA:
    • An introduction to optimization traces and visualizing them.

    • GAMA’s Events.

  • Developers notes:
    • A project overview.

    • How to add a search or post processing step.


AutoML Pipeline

An AutoML system performs several operations in its search for a model, and each of them may have several options and hyperparameters. An important decision is picking the search algorithm, which performs search over machine learning pipelines for your data. Another choice would be how to construct a model after search, e.g. by training the best pipeline or constructing an ensemble. Similarly to how data processing algorithms can form a machine learning pipeline, we will refer to a configuration of these AutoML components as an AutoML Pipeline. In GAMA we currently support flexibility in the AutoML pipeline in two stages: search and post-processing. See Adding Your Own Search or Postprocessing for more information on how to add your own.

Search Algorithms

The following search algorithms are available in GAMA:

  • Random Search: Randomly pick machine learning pipelines from the search space and evaluate them.

  • Asynchronous Evolutionary Algorithm: Evolve a population of machine learning pipelines, drawing new machine learning pipelines from the best of the population.

  • Asynchronous Successive Halving Algorithm: A bandit-based approach where many machine learning pipelines iteratively get evaluated and eliminated on bigger fractions of the data.

Post-processing

The following post-processing steps are available:

  • None: no post-processing will be done. This means no final pipeline will be trained and predict and predict_proba will be unavailable. This can be interesting if you are only interested in the search procedure.

  • FitBest: fit the single best machine learning pipeline found during search.

  • Ensemble: create an ensemble out of evaluated machine learning pipelines. This requires more time but can lead to better results.

Configuring the AutoML pipeline

By default ‘prepend pipeline’, ‘Asynchronous EA’ and ‘FitBest’ are chosen for pre-processing, search and post-processing, respectively. However, it is easy to change this, or to change the hyperparameters with which each component is used. For example, searching with ‘Asynchronous Successive Halving’ and creating an ensemble during post-processing:

from gama import GamaClassifier
from gama.search_methods import AsynchronousSuccessiveHalving
from gama.postprocessing import EnsemblePostProcessing

custom_pipeline_gama = GamaClassifier(search=AsynchronousSuccessiveHalving(), post_processing=EnsemblePostProcessing())

or using ‘Asynchronous EA’ but with custom hyperparameters:

from gama import GamaClassifier
from gama.search_methods import AsyncEA

custom_pipeline_gama = GamaClassifier(search=AsyncEA(population_size=30))

GAMA Search Space Configuration

By default GAMA will build pipelines out of scikit-learn algorithms, both for preprocessing and learning models. It is possible to modify this search space, changing the algorithms or hyperparameter ranges to consider.

The search space is determined by the search_space dictionary passed upon initialization. The defaults are found in classification.py and regression.py for the GamaClassifier and GamaRegressor, respectively.

A sample of algorithms that GAMA uses by default:

The search space configuration is defined in a python dictionary. For reference, a minimal example search space configuration can look like this:

from sklearn.naive_bayes import BernoulliNB
search_space = {
    'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
    BernoulliNB: {
        'alpha': [],
        'fit_prior': [True, False]
    }
}

At the top level, allowed key types are:

  • string, with a list as value.

It specifies the name of a hyperparameter with its possible values. By defining a hyperparameter at the top level, you can reference it as hyperparameter for any specific algorithm. To do so, identify it with the same name and set its possible values to an empty list (see alpha in the example). The benefit of doing is that multiple algorithms can share a hyperparameter space that is defined only once. Additionally, in evolution this makes it possible to know which hyperparameter values can be crossed over between different algorithms.

  • class, with a dictionary as value.

The key specifies the algorithm, calling it should instantiate the algorithm. The dictionary specifies the hyperparameters by name and their possible values as list. All hyperparameters specified should be taken as arguments for the algorithm’s initialization. A hyperparameter specified at the top level of the dictionary can share a name with a hyperparameter of the algorithm. To use the values provided by the shared hyperparameter, set the possible values to an empty list. If a list of values is provided instead, it will not use the shared hyperparameter values.


Logging

GAMA makes use of the default Python logging module. This means logs can be captured at different levels, and handled by one of several StreamHandlers.

The most common logging use cases are to write a comprehensive log to file, as well as print important messages to stdout. Writing log messages to stdout is directly supported by GAMA through the verbosity hyperparameter (which defaults to logging.WARNING).

By default GAMA will also save several different logs. This can be turned off by the store hyperparameter. The store hyperparameter allows you to store the logs, as well as models and predictions. By default logs are kept (which includes evaluation data), but models and predictions are discarded.

The output_directory hyperparameter determines where this data is stored, by default a unique name is generated. In the output directory you will find three files and a subdirectory:

  • ‘evaluations.log’: a csv file (with ‘;’ as separator) in which each evaluation is stored.

  • ‘gama.log’: A loosely structured file with general (human readable) information of the GAMA run.

  • ‘resources.log’: A record of the memory usage for each of GAMA’s processes over time.

  • cache directory: contains evaluated models and predictions, only if store is ‘all’ or ‘models’

If you want other behavior, the logging module offers you great flexibility on making your own variations. The following script writes any log messages of logging.DEBUG or up to both file and console:

import logging
import sys
from gama import GamaClassifier

gama_log = logging.getLogger('gama')
gama_log.setLevel(logging.DEBUG)

fh_log = logging.FileHandler('logfile.txt')
fh_log.setLevel(logging.DEBUG)
gama_log.addHandler(fh_log)

# The verbosity hyperparameter sets up an StreamHandler to `stdout`.
automl = GamaClassifier(max_total_time=180, verbosity=logging.DEBUG, store="nothing")

Running the above script will create the ‘logfile.txt’ file with all log messages that could also be seen in the console. An overview the log levels:

  • DEBUG: Messages for developers.

  • INFO: General information about the optimization process.

  • WARNING: Serious errors that do not prohibit GAMA from running to completion (but results could be suboptimal).

  • ERROR: Errors which prevent GAMA from running to completion.

As described in Dashboard the files in the output directory can be used to generate visualizations about the optimization process.


Events

It is also possible to programmatically receive updates of the optimization process through the events:

from gama import GamaClassifier

def print_evaluation(evaluation):
    print(f'{evaluation.individual.pipeline_str()} was evaluated. Fitness is {evaluation.score}.')

automl = GamaClassifier()
automl.evaluation_completed(print_evaluation)
automl.fit(X, y)

The function passed to evaluation_completed should take a gama.genetic_programming.utilities.evaluation_library.Evaluation as single argument. Any exceptions raised but not handled in the callback will be ignored but logged at logging.WARNING level. During the callback a stopit.utils.TimeoutException may be raised. This signal normally indicates to GAMA to move on to the next step in the AutoML pipeline. If caught by the callback, GAMA may exceed its allotted time. For this reason, it is advised to keep callbacks short after catching a stopit.utils.TimeoutException. If the stopit.utils.TimeoutException is not caught, GAMA will correctly terminate its step in the AutoML pipeline and continue as normal.


Developers Notes

Adding Your Own Search or Postprocessing

Note

This is not set in stone. As more AutoML pipeline steps are added by more people, we expect to identify parts of the interface to be improved. We can’t do this without your feedback! Feel free to get in touch, preferably in the form of a public discussion on a Github issue, and let us know what difficulties you encounter, or what works well!

This section contains information about implementing your own Search or Postprocessing procedures. To keep interfaces uniform across the different search or postprocesing implementations, each should derive from their respective baseclass (BaseSearch and BasePostProcessing). They each have their own processing method (search for BaseSearch and post_process for BasePostProcessing) which should be implemented. We will show example implementations further down.

Your Search or Postprocessing algorithm may feature hyperparameters, care should be taken to provide good default values. For some algorithms, hyperparameter default values are best specified based on characteristics of the dataset. For instance, with big datasets it might be useful to perform (some of) the workload on a subset of the data. We refer to these data-dependent non-static defaults as ‘dynamic defaults’. Both BaseSearch and BasePostProcessing feature a dynamic_defaults method which is called before search and ..., respectively. This allows you to overwrite default hyperparameter values based on the dataset properties. The hyperparameter values with which your search or postprocessing will be called is determined in the following order:

  • User specified values are used if specified (e.g. EnsemblePostProcessing(n=25))

  • Otherwise the values determined by dynamic_defaults are used

  • If neither are specified, the static default values are used.

PostProcessing

PostProcessing follows a similar pattern, where the class should allow initialization with its hyperparameters, an implementation of dynamic defaults (optional), and a post_process function.

class gama.postprocessing.BasePostProcessing(time_fraction: float)[source]

All post-processing methods should be derived from this class. This class should not be directly used to configure GAMA.

Parameters

time_fraction (float) – Fraction of total time that to be reserved for this post-processing step.

property hyperparameters

Hyperparameter (name, value) pairs.

Value determined by user > dynamic default > static default. Dynamic default values only considered if dynamic_defaults has been called.

post_process(x: pandas.core.frame.DataFrame, y: Union[pandas.core.frame.DataFrame, pandas.core.series.Series], timeout: float, selection: List[gama.genetic_programming.components.individual.Individual])object[source]
Parameters
  • x (pd.DataFrame) – all training features

  • y (Union[pd.DataFrame, pd.Series]) – all training labels

  • timeout (float) – allowed time in seconds for post-processing

  • selection (List[Individual]) – individuals selected by the search space, ordered best first

Returns

A model with predict and optionally predict_proba.

Return type

Any

to_code(preprocessing: Optional[Sequence[Tuple[str, sklearn.base.TransformerMixin]]] = None)str[source]

Generate Python code to reconstruct a pipeline that constructs the model.

Parameters

preprocessing (Sequence[TransformerMixin], optional (default=None)) – Preprocessing steps that need be executed before the model.

Returns

A string of Python code that sets a ‘pipeline’ variable to the pipeline that defines the final pipeline generated by post-processing.

Return type

str

Unlike the search methods, which are not required to have any hyperparameter, post processing is required to have a default value for time_fraction. time_fraction is the fraction of the total time that should be reserved for the post processing method (as set on initialization through max_total_time). For instance, when a post-processing object’s time_fraction is 0.3 and GAMA is initiated with max_total_time=3600, then 3600*0.3=1080 seconds are reserved for the post-processing phase.

Note

While hard, it is important to provide an accurate estimate for time_fraction. If you reserve too much time, it means that the search procedure will have to be cut off unnecessarily early. If too little time is reserved, GAMA will interrupt the post-processing step and return control to the user. It is generally hard to know to how much to reserve, and is likely dependent on the dataset and number of evaluated pipelines in search. We would like to implement ways in which post-processing methods have access to these statistics and allow them to update their time estimate, so that less time is wasted on too long or too short post-processing phases.