compactem.model_builder package

Package for the model builder base class, as well as implementations.

Submodules

compactem.model_builder.DecisionTreeModelBuilder module

class DecisionTree(complexity_param, *args, **kwargs)

Bases: compactem.model_builder.base_model.ModelBuilderBase

scikits decision tree is used.

The complexity parameter is the max_depth.

Parameters
  • complexity_parammax_depth of the tree, scalar int

  • args

  • kwargs

fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)

If inside_optimizer_iteration is set to True a held-out set is used for model selection in the params search space. This is repeated a few times per param. If False, i.e., this function call occurs outside of the optimization step, we perform a CV-based grid search.

Parameters
  • X – 2D array to perform model selection on

  • y – corresponding labels

  • params – model parameter search space

  • inside_optimizer_iteration – boolean to indicate if function is called inside optimizer

  • args

  • kwargs

Returns

best model across params search space, parameters for this model

get_avg_complexity(list_of_estimators, *args, **kwargs)

Average is defined as median depth cast to int.

Parameters
  • list_of_estimators – list of decision tree models

  • args

  • kwargs

Returns

median tree depth

get_baseline_fit_params(*args, **kwargs)
Parameters
  • args

  • kwargs

Returns

param search space with min_impurity_decrease and max_depth

get_complexity(estimator, *args, **kwargs)
Parameters
  • estimator – decision tree object

  • args

  • kwargs

Returns

depth of tree

static get_complexity_param_range(X, y, *args, **kwargs)

Perform a grid search CV till a max depth.

Parameters
  • X – 2D array to perform model selection on

  • y – corresponding labels

  • args

  • kwargs

Returns

list of max_depths from 1…max_depth discovered

get_iteration_fit_params(*args, **kwargs)
Parameters
  • args

  • kwargs

Returns

param search space with min_impurity_decrease and max_depth

compactem.model_builder.GradientBoostingClassifier module

class GradientBoostingModel(complexity_param, *args, **kwargs)

Bases: compactem.model_builder.base_model.ModelBuilderBase

Note

Unlike Random Forest, the complexity in terms of actual tree depths cannot be computed since LightGBM does not expose that information: see here. The max_depth itself is returned as one of the complexity dimensions (the other being number of boosting rounds or trees).

We define a tuple as the complexity (max_depth, num_boosting_rounds). Additional keyword arguments may be provided. Currently supported:

  • categorical indices: lightgbm can treat dimensions as categorical, a list of categorical indices may be passed in.

  • learning_rate

Parameters
  • complexity_param – tuple (max_depth, num_boosting_rounds)

  • args

  • kwargs

    the following additional keyword arguments are supported:

    • categorical_idxs: LightGBM’s can treat dimensions as categorical, a list of categorical indices may be passed in. This is its parameter categorical_feature. Default: 'auto'.

    • learning_rate: LightGBM’s parameter learning_rate. Default: 0.1.

fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)

Note

num_threads should be set to 1 because of this issue.

Parameters
  • X – 2D array to perform model selection on

  • y – corresponding labels

  • params – model parameter search space

  • inside_optimizer_iteration – denotes if this is call from within the optimizer iterations

  • args

  • kwargs

Returns

LightGBMWrapper object, best_parameters in params search space

get_avg_complexity(list_of_estimators, *args, **kwargs)

median of max_depth, boosting rounds

Parameters
  • list_of_estimators – list of LightGBMWrapper objects

  • args

  • kwargs

Returns

median of max_depth, boosting rounds

get_baseline_fit_params()
Returns

dict of params, see code.

get_complexity(clf, *args, **kwargs)

Complexity is the max_depth and best boosting iteration. The actual depths per tree is not made available in the LightGBM API (see here), so max_depth (same as the supplied initialization) parameter is returned. “max_depth” is the upper bound on what we know about the depth complexity.

Both properties can be acquired from the object init properties, so this function just returns them.

Parameters
  • clf – LightGBMWrapper object

  • args

  • kwargs

Returns

max_depth, num_boosting_rounds

static get_complexity_param_range(X, y, *args, **kwargs)

We go over a range of max_depths with a fixed number of (very high) number of boosting rounds with early stopping. This avoids a combinatorial search of the space, since we get the “natural” number of rounds for a given max_depth.

Since this leads to volatility in the number of rounds discovered i.e. the same max_depth might lead (very) different boosting rounds, we train with the same max_depth multiple times, and use the median number of rounds.

TODO: there is still some amount of volatility that needs to be revisited. Maybe we should group results that are close enough and pick the small complexity among the group with the highest representative accuracy?

Parameters
  • X – 2D array to perform model selection on

  • y – corresponding labels

  • args

  • kwargs

Returns

get_iteration_fit_params()
Returns

dict of params, see code.

class LightGBMWrapper

Bases: object

Wrapper around LightGBM. This is eventually used by the ModelBuilder.

fit(X, y, params, categorical_idxs='auto')
predict(X)
predict_proba(X)

Returns per label probability value as a dict :param X: :return:

get_balanced_sample_weights(y)

This function coputes sample weight in a way that all classes end up with the same total weight. We need to perform this step since LightGBM doesn’t allow class weights.

Parameters

y – list of labels (we don’t need instances per se to calculate weights)

Returns

sample weights, list of floats with same length as y

try_basic_functionality()

compactem.model_builder.LarsAndRidge module

class LarsAndRidgeBinary(non_zero_terms=1, alphas=None, balance=False, fit_ridge=False)

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Objects of this class would be used by scikits OneVsRest classifier.

Warning

I have stopped development of the fit_ridge option. Consider it not tested and marked for deprecation.

Parameters
  • non_zero_terms

  • alphas – ignored when fit_ridge=False

  • balance

  • fit_ridge – when we want to use LARs alone, this is faster obviously since there aren’t any Ridge fits

decision_function(X)

Needed by the one-vs-rest classifier

fit(X, y)
get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

predict(X)
set_params(**parameters)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object

check_decisions_multiclass(decisons, y)

Does some basic checks on the decision values returned.

Parameters
  • decisons

  • y

Returns

confidence_from_scaled_decisions(scaled_decisions, strategy='top_two')
test_LarsAndRidgeBinary_2class(use_ovr=False)

These are ‘visual’ tests. Manually look at the output.

Parameters

use_ovr – you can optionally do the binary test with the OVR too, just to check how the OVR works with binary.

Returns

test_LarsAndRidgeBinary_multiclass()

‘visual’ tests

Returns

compactem.model_builder.LinearProbabilityModel module

class LinearProbabilityModel(complexity_param, *args, **kwargs)

Bases: compactem.model_builder.base_model.ModelBuilderBase

Wrapper methods for building a Linear Probability Model (LPM) - since it has some advantages over logistic regression in terms of interpretability. General idea:

  • select given number of features using LARS

  • fit on features using Ridge (note: Ridge might be deprecated)

The complexity param here is the number of terms with non-zero coefficients in a LPM; if the dataset has n classes, a one-vs-all estimator is constructed, and the complexity param applies to each component classifier.

Parameters
  • complexity_param – number of terms with non-zero coefficients

  • args

  • kwargs

fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)

No parameter search here since the complexity param and complexity are both number of terms with non-zero coefficients, and is expected that the complexity passed in <= the unbounded complexity found via CV.

Parameters
  • X – 2D array to perform model selection on

  • y – corresponding labels

  • params – model parameter search space

  • inside_optimizer_iteration – boolean to indicate if function is called inside optimizer

  • args

  • kwargs

Returns

best model across params search space, parameters for this model

Returns

get_avg_complexity(list_of_estimators, *args, **kwargs)

Calculate “average” complexity across a bunch of ova LPMs: here the “average” is essentially a check that all ova LPMs have the same complexity: if not, this is an error.

Parameters
  • list_of_estimators – list of ova LPMs

  • args

  • kwargs

Returns

# non-zero coefficients in the ova LPMs if they are identical, else raise error

get_baseline_fit_params(*args, **kwargs)

Returns a search space for the ridge regression; doesn’t affect number of terms with non-zero coefficients. NOTE: support for ridge might be deprecated soon.

Parameters
  • args

  • kwargs

Returns

get_complexity(ova_estimator, *args, **kwargs)
Parameters
  • ova_estimator – one-vs-all LPM classifier

  • args

  • kwargs

Returns

complexity of ova estimator - this is computed as the number of non-zero coefficients per LPM in the one-vs-all classifier, and raises ValueError if this number is not identical across them

static get_complexity_param_range(X, y, *args, **kwargs)

Grid search via CV is performed to find the best number of non-zero coefficients in the range 1…np.shape(X)[1].

Parameters
  • X – 2D array to perform model selection on

  • y – corresponding labels

  • args

  • kwargs

Returns

list of non-zero coefficients from 1 up to whatever grid search discovered

get_iteration_fit_params(*args, **kwargs)

Returns a search space for the ridge regression; doesn’t affect number of terms with non-zero coefficients. NOTE: support for ridge might be deprecated soon.

Parameters
  • args

  • kwargs

Returns

compactem.model_builder.RandomForestClassifier module

class RandomForest(complexity_param, *args, **kwargs)

Bases: compactem.model_builder.base_model.ModelBuilderBase

We define a tuple as the complexity (max_depth, n_estimators)

Parameters

complexity_param – tuple (max_depth, n_estimators)

fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)

Only CV is performed, and based on whether the fit is inside the optimizer or outside it, we change the number of folds. Also note: in addition to the params passed in, this adds an additional search space dimension “max_features”: this is done here since we use Breiman’s suggestion as a lower bound, \(\log_2(\text{num_features}) + 1\) which is data dependent. It is determined here based on the input X.

Parameters
  • X – 2D array to perform model selection on

  • y – corresponding labels

  • params – model parameter search space

  • inside_optimizer_iteration – denotes if this is call from within the optimizer iterations

Returns

scikit’s RandomForestClassifier object, best_parameters in params search space

get_avg_complexity(list_of_estimators, *args, **kwargs)

We compute the median per complexity dimension separately.

Parameters

list_of_estimators – list of scikit’s RandomForestClassifier objects

Returns

median of the complexities i.e. median of Random Forest depths, and number of trees, as calculated by get_complexity()

get_baseline_fit_params(*args, **kwargs)
Parameters
  • args

  • kwargs

Returns

search space has exactly one value of max_depth and n_estimators - the complexity param defined for this object

get_complexity(estimator, *args, **kwargs)
Parameters
  • estimator – scikit’s RandomForestClassifier object

  • args

  • kwargs

Returns

median depth (can be float), number of trees

static get_complexity_param_range(X, y, hold_fixed=None, *args, **kwargs)

Provides the complexity range for a RF in terms of max_depth and n_estimators. For experiments, where both parameters don’t need to change, one may specify the parameter to hold constant, and what value.

Parameters
  • X – 2D array to perform model selection on

  • y – corresponding labels

  • hold_fixed – dict to specify which parameter to hold fixed. For ex, if we want to hold max_depth at 5, this argument should be {‘max_depth’: 5}

  • args

  • kwargs

Returns

get_iteration_fit_params(*args, **kwargs)
Parameters
  • args

  • kwargs

Returns

search space has exactly one value of max_depth and n_estimators - the complexity param defined for this object

calculate_RandomForest_complexity(estimator)

We calculate the complexity of a Random Forest as the (median of tree depths, number of trees). This logic doesn’t depend on the dataset, and defining at the module level means we can reuse it at different places.

Parameters

estimator – scikit’s RandomForestClassifier object

Returns

median depth (can be float), number of trees

compactem.model_builder.base_model module

class ModelBuilderBase(complexity_param, *args, **kwargs)

Bases: object

Pythons’ enforcement of abstract class is weak in the sense that deriving classes don’t need to match the function signature. The signature provided herein should be used as “documentation” if you don’t want stuff to break.

Guiding principles:

  • abstract methods to be implemented in subclass.

  • methods that mention “Do not override in subclass.” in their doc string typically shouldn’t be overriden, unless you want to change some fundamental behavior.

  • all else is optional and whether to implement should be decided based on the doc string.

The complexity parameter is fixed for an object of this class. In other words, an object of this class can build models only for a fixed complexity, controlled by the complexity_param.

Parameters

complexity_param – the param that decides the complexity of the model to be learned.

data_split_resolver(dataset_identifier)

Do not override in subclass.

A convenience function that allow referring to splits by name. A helper function to __resolve_datasets_for_fit_and_eval__(). If a tuple (X, y) of data is passed, this transparently returns it with no processing.

Parameters

dataset_identifier – can be a string (‘train’, ‘val’, ‘train_val’, ‘test’) or a tuple

Returns

data X, y

fit_and_evaluate(fit_on, eval_on, params, inside_optimizer_iteration=False, num_train_points=None, **kwargs)

Do not override in subclass.

This is a wrapper around fit_and_evaluate_on_data() which subclass must implement.

Parameters
  • fit_on – data to train on, can be a string (‘train’, ‘val’, ‘train_val’, ‘test’) or a tuple

  • eval_on – validation data, can be a string (‘train’, ‘val’, ‘train_val’, ‘test’) or a tuple

  • params – parameters for model selection

  • inside_optimizer_iteration – denotes if this is call from within the optimizer iterations - it might make sense to define the function based on the location of call

  • num_train_points – number of training points (stratified) to use from fit_on. If None use all points.

Returns

score on validation data, best model learned on fit_on data, best paramters from params

abstract fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)

This is the key model training function: it implements how a model is trained on a dataset, given a parameter range to search. Other functions in this class rely on this method.

Parameters
  • X – 2D array to perform model selection on

  • y – labels

  • params – param range to search - no fixed format, since subclass decides how “params” is produced by other functions like get_baseline_fit_params(), which also subclass implements. Must be consistent.

  • inside_optimizer_iteration – denotes if this is call from within the optimizer iterations - it might make sense to define the function based on the location of call

Returns

best_model (must support predict()), best_params

Attention

Remember to handle cases when the data passed in might not be “proper”, e.g. sample has only points of one label. This might happen when the optimizer is exploring the search space. Handle such cases by returning an accuracy of 0, so that the optimizer learns to avoid them.

fit_baseline_model(all_baselines=False, num_train_points=None)

Do not override in subclass.

Fit the baseline model on different splits. Reuse fit_and_evaluate() here.

Parameters
  • all_baselines – do we need all combinations of baseline models? Probably not, for practical use. Combinations here mean train on train and report score on train, train on train report on validation etc. This option was used initially for research.

  • num_train_points – number of points to use from the training split

Returns

return various scores and models (consult source)

fit_model_within_iteration(X, y, **kwargs)

Do not override in subclass.

This fit method is invoked within the optimization loop. We reuse fit_and_evaluate() here.

Parameters
  • X – data generated with current density params

  • y – labels for X

  • **kwargs

    any other params to be passed on to fit_and_evaluate

Returns

scores on train, val and test; and the model fit on train.

abstract get_avg_complexity(list_of_estimators, *args, **kwargs)

Define how would you calculate the average complexity of estimators. For ex in the case of decision trees this could be the median decision tree depth.

Parameters

list_of_estimators

Returns

average of the complexities of the estimators

abstract get_baseline_fit_params(*args, **kwargs)

This function should return a range of parameters, across which a model is to be selected as the baseline model. The format for this range is up to the user since it would be handled by fit_and_evaluate_on_data() which also needs to be implemented in the sub-class.

Returns

parameter range across which the best baseline model is to be picked

abstract get_complexity(estimator, *args, **kwargs)

Get the complexity of the model passed in.

Parameters

estimator – model whose complexity is to be computed

Returns

complexity value

static get_complexity_param_range(X, y, *args, **kwargs)

An implementation should return the range of sizes, as an iterable, that models need to be built for. This is helpful only if user wants this range to be derived based on some data (X, y), e.g., if they want to build models with complexity lesser than what natural optimal complexity of the model found via cross-validation. This is grouped here to keep things related to model building in one place.

This is not an object method because this provides sizes/complexities, that would be required to instantiate an object with. This is intended to be a convenience method, and optional to implement.

Parameters
  • X – 2D array of data based on which parameter range must be determined

  • y – corresponding labels

Returns

parameter range

abstract get_iteration_fit_params(*args, **kwargs)

This function should return the range of parameters, across which a model is to be selected within an optimizer iteration. Ideally, this should be cheap to compute since this is within an iteration. The format for this range is up to the user since it would be handled by fit_and_evaluate_on_data() which also needs to be implemented in the sub-class.

Returns

parameter range across which the best model within an optimizer iteration is to be picked

load_data_splits(X_train, y_train, X_train_val, y_train_val, X_val, y_val, X_test, y_test)

Do not override in subclass.

The data splits are assigned in one place so that the overhead due to passing in different function calls are avoided.

Parameters
  • X_train – 2D array denoting train data

  • y_train – corresponding train labels

  • X_train_val – 2D array denoting train+val data

  • y_train_val – corresponding train+val labels

  • X_val – 2D array denoting val data

  • y_val – corresponding val labels

  • X_test – 2D array denoting test data

  • y_test – corresponding test labels

Returns

None

static save_model(model, file_path_no_ext, *args, **kwargs)

If this is not implemented by a subclass, an attempt would be made to save model with pickle. This is not an object function, and is grouped here to keep all things related to model building at one place.

Parameters
  • model – model object to save

  • file_path_no_ext – the path where the model is to be saved, without the extension. The final path must returned to add to the result files. If a subclass implements this, its good practice to add an extension for usability.

Returns

path where file is saved