API¶
GAMA¶
GamaClassifier¶
-
class
gama.
GamaClassifier
(config=None, scoring='neg_log_loss', *args, **kwargs)[source]¶ Gama with adaptations for (multi-class) classification.
- Parameters
scoring (str, Metric or Tuple) – Specifies the/all metric(s) to optimize towards. A string will be converted to Metric. A tuple must specify each metric with the same type (e.g. all str). See Metrics for built-in metrics.
regularize_length (bool (default=True)) – If True, add pipeline length as an optimization metric. Short pipelines should then be preferred over long ones.
max_pipeline_length (int, optional (default=None)) – If set, limit the maximum number of steps in any evaluated pipeline. Encoding and imputation are excluded.
config (Dict) – Specifies available components and their valid hyperparameter settings. For more information, see GAMA Search Space Configuration.
random_state (int, optional (default=None)) – Seed for the random number generators used in the process. However, with
n_jobs > 1
, there will be randomization introduced by multi-processing. For reproducible results, set this and usen_jobs=1
.max_total_time (positive int (default=3600)) – Time in seconds that can be used for the
fit
call.max_eval_time (positive int, optional (default=None)) – Time in seconds that can be used to evaluate any one single individual. If None, set to 0.1 * max_total_time.
n_jobs (int, optional (default=None)) – The amount of parallel processes that may be created to speed up
fit
. Accepted values are positive integers, -1 or None. If -1 is specified, multiprocessing.cpu_count() processes are created. If None is specified, multiprocessing.cpu_count() / 2 processes are created.max_memory_mb (int, optional (default=None)) – Sets the total amount of memory GAMA is allowed to use (in megabytes). If not set, GAMA will use as much as it needs. GAMA is not guaranteed to respect this limit at all times, but it should never violate it for too long.
verbosity (int (default=logging.WARNING)) – Sets the level of log messages to be automatically output to terminal.
search (BaseSearch (default=AsyncEA())) – Search method to use to find good pipelines. Should be instantiated.
post_processing (BasePostProcessing (default=BestFitPostProcessing())) – Post-processing method to create a model after the search phase. Should be an instantiated subclass of BasePostProcessing.
output_directory (str, optional (default=None)) – Directory to use to save GAMA output. This includes both intermediate results during search and logs. If set to None, generate a unique name (“gama_HEXCODE”).
store (str (default='logs')) –
- Determines which data is stored after each run:
’nothing’: keep nothing from this run
’models’: keep only cache with models and predictions
’logs’: keep only the logs
’all’: keep logs and cache with models and predictions
Search Methods¶
AsynchronousSuccessiveHalving¶
-
class
gama.search_methods.
AsynchronousSuccessiveHalving
(reduction_factor: Optional[int] = None, minimum_resource: Optional[int] = None, maximum_resource: Optional[int] = None, minimum_early_stopping_rate: Optional[int] = None)[source]¶ Asynchronous Halving Algorithm by Li et al.
paper: https://arxiv.org/abs/1810.05934
- Parameters
reduction_factor (int, optional (default=3)) – Reduction factor of candidates between each rung.
minimum_resource (int, optional (default=100)) – Number of samples to use in the lowest rung.
maximum_resource (int, optional (default=number of samples in the dataset)) – Number of samples to use in the top rung. This should not exceed the number of samples in the data.
minimum_early_stopping_rate (int (default=1)) – Number of lowest rungs to skip.
AsyncEA¶
-
class
gama.search_methods.
AsyncEA
(population_size: Optional[int] = None, max_n_evaluations: Optional[int] = None, restart_callback: Optional[Callable[], bool]] = None)[source]¶ Perform asynchronous evolutionary optimization.
- Parameters
population_size (int, optional (default=50)) – Maximum number of individuals in the population at any time.
max_n_evaluations (int, optional (default=None)) – If specified, only a maximum of
max_n_evaluations
individuals are evaluated. If None, the algorithm will be run until interrupted by the user or a timeout.restart_callback (Callable[[], bool], optional (default=None)) – Function which takes no arguments and returns True if search restart.
Post-Processing¶
NoPostProcessing¶
BestFitPostProcessing¶
EnsemblePostProcessing¶
-
class
gama.postprocessing.
EnsemblePostProcessing
(time_fraction: float = 0.3, ensemble_size: Optional[int] = 25, hillclimb_size: Optional[int] = 10000, max_models: Optional[int] = 200)[source]¶ Ensemble construction per Caruana et al.
- Parameters
time_fraction (float (default=0.3)) – Fraction of total time reserved for Ensemble building.
ensemble_size (int, optional (default=25)) – Total number of models in the ensemble. When a single model is chosen more than once, it will increase its weight in the ensemble and does count towards this maximum.
hillclimb_size (int, optional (default=10_000)) – Number of predictions that are used to determine the ensemble score during hillclimbing. If
None
, use all.max_models (int, optional (default=200)) – Only consider the best
max_models
number of models. IfNone
, use all. Consequently also sets the max number of unique models in the ensemble.
Genetic Programming¶
Components¶
Defines the building blocks for Individuals.
Individuals represent machine learning pipelines in a back-end agnostic way.
An Individual can be converted to its back-end specific representation
(e.g. a scikit-learn Pipeline) by calling its pipeline
property
as long as a function has been provided to convert the individual to it.
Individuals are built with:
Terminals. Definition of a specific value for a specific hyperparameter. Immutable.
- Primitives. Definition of a specific algorithm. Immutable.
Defined by Terminal input, output type and operation.
- PrimitiveNodes. Mutable for easy operations (e.g. mutation).
An instantiated Primitive with specific Terminals.
Fitness. Stores information about the evaluation of the individual.
Individual¶
-
class
gama.genetic_programming.components.
Individual
(main_node: gama.genetic_programming.components.primitive_node.PrimitiveNode, to_pipeline: Optional[Callable] = None)[source]¶ Collection of PrimitiveNodes which together specify a machine learning pipeline.
- Parameters
main_node (PrimitiveNode) – The first node of the individual (the estimator node).
to_pipeline (Callable, optional (default=None)) – A function which can convert this individual into a machine learning pipeline. If not provided, the
pipeline
property will be unavailable.
Primitive¶
PrimitiveNode¶
-
class
gama.genetic_programming.components.
PrimitiveNode
(primitive: gama.genetic_programming.components.primitive.Primitive, data_node: Union[gama.genetic_programming.components.primitive_node.PrimitiveNode, str], terminals: List[gama.genetic_programming.components.terminal.Terminal])[source]¶ An instantiation for a Primitive with specific Terminals.
- Parameters
primitive (Primitive) – The Primitive type of this PrimitiveNode.
data_node (PrimitiveNode) – The PrimitiveNode that specifies all preprocessing before this PrimitiveNode.
terminals (List[Terminal]) – A list of terminals matching the
primitive
.
Mutation¶
Contains mutation functions for genetic programming. Each mutation function takes an individual and modifies it in-place.
-
gama.genetic_programming.mutation.
mut_insert
(individual: gama.genetic_programming.components.individual.Individual, primitive_set: dict) → None[source]¶ Mutate an Individual in-place by inserting a PrimitiveNode at a random location.
The new PrimitiveNode will not be inserted as root node.
- Parameters
individual (Individual) – Individual to mutate in-place.
primitive_set (dict) –
-
gama.genetic_programming.mutation.
mut_replace_primitive
(individual: gama.genetic_programming.components.individual.Individual, primitive_set: dict) → None[source]¶ Mutates an Individual in-place by replacing one of its Primitives.
- Parameters
individual (Individual) – Individual to mutate in-place.
primitive_set (dict) –
-
gama.genetic_programming.mutation.
mut_replace_terminal
(individual: gama.genetic_programming.components.individual.Individual, primitive_set: dict) → None[source]¶ Mutates an Individual in-place by replacing one of its Terminals.
- Parameters
individual (Individual) – Individual to mutate in-place.
primitive_set (dict) –
-
gama.genetic_programming.mutation.
mut_shrink
(individual: gama.genetic_programming.components.individual.Individual, primitive_set: Optional[dict] = None, shrink_by: Optional[int] = None) → None[source]¶ Mutates an Individual in-place by removing any number of primitive nodes.
Primitive nodes are removed from the preprocessing end.
- Parameters
individual (Individual) – Individual to mutate in-place.
primitive_set (dict, optional) – Not used. Present to create a matching function signature with other mutations.
shrink_by (int, optional (default=None)) – Number of primitives to remove. Must be at least one greater than the number of primitives in
individual
. If None, a random number of primitives is removed.
-
gama.genetic_programming.mutation.
random_valid_mutation_in_place
(individual: gama.genetic_programming.components.individual.Individual, primitive_set: dict, max_length: Optional[int] = None) → Callable[source]¶ Apply a random valid mutation in place.
The random mutation can be one of:
mut_random_primitive
mut_random_terminal, if the individual has at least one
mutShrink, if individual has at least two primitives
mutInsert, if it would not exceed
new_max_length
when specified.
- Parameters
individual (Individual) – An individual to be mutated in-place.
primitive_set (dict) – A dictionary defining the set of primitives and terminals.
max_length (int, optional (default=None)) – If specified, impose a maximum length on the new individual.
- Returns
The mutation function used.
- Return type
Callable
Crossover¶
Functions which take two Individuals and produce at least one new Individual.
-
gama.genetic_programming.crossover.
crossover_primitives
(ind1: gama.genetic_programming.components.individual.Individual, ind2: gama.genetic_programming.components.individual.Individual) → Tuple[gama.genetic_programming.components.individual.Individual, gama.genetic_programming.components.individual.Individual][source]¶ Crossover two individuals by exchanging any number of preprocessing steps.
- Parameters
ind1 (Individual) – The individual to crossover with individual2.
ind2 (Individual) – The individual to crossover with individual1.
-
gama.genetic_programming.crossover.
crossover_terminals
(ind1: gama.genetic_programming.components.individual.Individual, ind2: gama.genetic_programming.components.individual.Individual) → Tuple[gama.genetic_programming.components.individual.Individual, gama.genetic_programming.components.individual.Individual][source]¶ Crossover two individuals in-place by exchanging two Terminals.
Terminals must share output type but have different values.
- Parameters
ind1 (Individual) – The individual to crossover with individual2.
ind2 (Individual) – The individual to crossover with individual1.
-
gama.genetic_programming.crossover.
random_crossover
(ind1: gama.genetic_programming.components.individual.Individual, ind2: gama.genetic_programming.components.individual.Individual, max_length: Optional[int] = None) → Tuple[gama.genetic_programming.components.individual.Individual, gama.genetic_programming.components.individual.Individual][source]¶ Random valid crossover between two individuals in-place, if it can be done.
- Parameters
ind1 (Individual) – The individual to crossover with ind2.
ind2 (Individual) – The individual to crossover with ind1.
max_length (int, optional(default=None)) – The first individual in the returned tuple has at most
max_length
primitives. Requires both provided individuals to contain at mostmax_length
primitives.
- Raises
If there is no valid crossover function for the two individuals. - If
max_length
is set and eitherind1
orind2
contain more primitives thanmax_length
.
Utilities¶
Generic¶
Collection of generic components.
Pareto Front¶
-
class
gama.utilities.generic.paretofront.
ParetoFront
(start_list: Optional[List[Any]] = None, get_values_fn: Optional[Callable[[Any], Tuple[Any, …]]] = None)[source]¶ A list of tuples in which no one tuple is dominated by another.
- Parameters
start_list (list, optional (default=None)) – List of items of which to calculate the Pareto front.
get_values_fn (Callable, optional (default=None)) – Function that takes an item and returns a tuple of values, such that each should be maximized. If left None, it is assumed that items are already such tuples.
Stopwatch¶
Timekeeper¶
-
class
gama.utilities.generic.timekeeper.
TimeKeeper
(total_time: Optional[int] = None)[source]¶ Simple object that helps keep track of time over multiple activities.
- Parameters
total_time (int, optional (default=None)) – The total time available across activities. If set to None, the
total_time_remaining
property will be unavailable.
AsyncEvaluator¶
Warning
I’m sure there are better tools out there, but I have yet to find a minimal easy multi-processing tool. I tried using the built-in ProcessPoolExecutor, but it had short comings such as not being able to cancel jobs while they were running.
-
class
gama.utilities.generic.async_evaluator.
AsyncEvaluator
(n_workers: Optional[int] = None, memory_limit_mb: Optional[int] = None, logfile: Optional[str] = None, wait_time_before_forced_shutdown: int = 10)[source]¶ Manages subprocesses on which arbitrary functions can be evaluated.
The function and all its arguments must be picklable. Using the same AsyncEvaluator in two different contexts raises a
RuntimeError
.- defaults: Dict, optional (default=None)
Default parameter values shared between all submit calls. This allows these defaults to be transferred only once per process, instead of twice per call (to and from the subprocess). Only supports keyword arguments.
- Parameters
n_workers (int, optional (default=None)) – Maximum number of subprocesses to run for parallel evaluations. Defaults to
AsyncEvaluator.n_jobs
, using all cores unless overwritten.memory_limit_mb (int, optional (default=None)) – The maximum number of megabytes that this process and its subprocesses may use in total. If None, no limit is enforced. There is no guarantee the limit is not violated.
logfile (str, optional (default=None)) – If set, recorded resource usage will be written to this file.
wait_time_before_forced_shutdown (int (default=10)) – Number of seconds to wait between asking the worker processes to shut down and terminating them forcefully if they failed to do so.