compactem.model_builder package

Package for the model builder base class, as well as implementations.

Submodules

compactem.model_builder.DecisionTreeModelBuilder module

class DecisionTree(complexity_param, *args, **kwargs)

Bases: compactem.model_builder.base_model.ModelBuilderBase

scikits decision tree is used.

The complexity parameter is the max_depth.

Parameters

complexity_param – max_depth of the tree, scalar int
args –
kwargs –

fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)

If inside_optimizer_iteration is set to True a held-out set is used for model selection in the params search space. This is repeated a few times per param. If False, i.e., this function call occurs outside of the optimization step, we perform a CV-based grid search.

Parameters

X – 2D array to perform model selection on
y – corresponding labels
params – model parameter search space
inside_optimizer_iteration – boolean to indicate if function is called inside optimizer
args –
kwargs –

Returns

best model across params search space, parameters for this model

get_avg_complexity(list_of_estimators, *args, **kwargs)

Average is defined as median depth cast to int.

Parameters

list_of_estimators – list of decision tree models
args –
kwargs –

Returns

median tree depth

get_baseline_fit_params(*args, **kwargs)

Parameters

args –
kwargs –

Returns

param search space with min_impurity_decrease and max_depth

get_complexity(estimator, *args, **kwargs)

Parameters

estimator – decision tree object
args –
kwargs –

Returns

depth of tree

static get_complexity_param_range(X, y, *args, **kwargs)

Perform a grid search CV till a max depth.

Parameters

X – 2D array to perform model selection on
y – corresponding labels
args –
kwargs –

Returns

list of max_depths from 1…max_depth discovered

get_iteration_fit_params(*args, **kwargs)

Parameters

args –
kwargs –

Returns

param search space with min_impurity_decrease and max_depth

compactem.model_builder.GradientBoostingClassifier module

class GradientBoostingModel(complexity_param, *args, **kwargs)

Bases: compactem.model_builder.base_model.ModelBuilderBase

Note

Unlike Random Forest, the complexity in terms of actual tree depths cannot be computed since LightGBM does not expose that information: see here. The max_depth itself is returned as one of the complexity dimensions (the other being number of boosting rounds or trees).

We define a tuple as the complexity (max_depth, num_boosting_rounds). Additional keyword arguments may be provided. Currently supported:

categorical indices: lightgbm can treat dimensions as categorical, a list of categorical indices may be passed in.
learning_rate

Parameters

complexity_param – tuple (max_depth, num_boosting_rounds)
args –
kwargs –
the following additional keyword arguments are supported:
- categorical_idxs: LightGBM’s can treat dimensions as categorical, a list of categorical indices may be passed in. This is its parameter categorical_feature. Default: 'auto'.
- learning_rate: LightGBM’s parameter learning_rate. Default: 0.1.

fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)

Note

num_threads should be set to 1 because of this issue.

Parameters

X – 2D array to perform model selection on
y – corresponding labels
params – model parameter search space
inside_optimizer_iteration – denotes if this is call from within the optimizer iterations
args –
kwargs –

Returns

LightGBMWrapper object, best_parameters in params search space

get_avg_complexity(list_of_estimators, *args, **kwargs)

median of max_depth, boosting rounds

Parameters

list_of_estimators – list of LightGBMWrapper objects
args –
kwargs –

Returns

median of max_depth, boosting rounds

get_baseline_fit_params()

Returns: dict of params, see code.

get_complexity(clf, *args, **kwargs)

Complexity is the max_depth and best boosting iteration. The actual depths per tree is not made available in the LightGBM API (see here), so max_depth (same as the supplied initialization) parameter is returned. “max_depth” is the upper bound on what we know about the depth complexity.

Both properties can be acquired from the object init properties, so this function just returns them.

Parameters

clf – LightGBMWrapper object
args –
kwargs –

Returns

max_depth, num_boosting_rounds

static get_complexity_param_range(X, y, *args, **kwargs)

We go over a range of max_depths with a fixed number of (very high) number of boosting rounds with early stopping. This avoids a combinatorial search of the space, since we get the “natural” number of rounds for a given max_depth.

Since this leads to volatility in the number of rounds discovered i.e. the same max_depth might lead (very) different boosting rounds, we train with the same max_depth multiple times, and use the median number of rounds.

TODO: there is still some amount of volatility that needs to be revisited. Maybe we should group results that are close enough and pick the small complexity among the group with the highest representative accuracy?

Parameters

X – 2D array to perform model selection on
y – corresponding labels
args –
kwargs –

Returns

get_iteration_fit_params()

Returns: dict of params, see code.

class LightGBMWrapper

Bases: object

Wrapper around LightGBM. This is eventually used by the ModelBuilder.

fit(X, y, params, categorical_idxs='auto')

predict(X)

predict_proba(X): Returns per label probability value as a dict :param X: :return:

get_balanced_sample_weights(y)

This function coputes sample weight in a way that all classes end up with the same total weight. We need to perform this step since LightGBM doesn’t allow class weights.

Parameters: y – list of labels (we don’t need instances per se to calculate weights)
Returns: sample weights, list of floats with same length as y

try_basic_functionality()

compactem.model_builder.LarsAndRidge module

class LarsAndRidgeBinary(non_zero_terms=1, alphas=None, balance=False, fit_ridge=False)

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Objects of this class would be used by scikits OneVsRest classifier.

Warning

I have stopped development of the fit_ridge option. Consider it not tested and marked for deprecation.

Parameters

non_zero_terms –
alphas – ignored when fit_ridge=False
balance –
fit_ridge – when we want to use LARs alone, this is faster obviously since there aren’t any Ridge fits

decision_function(X): Needed by the one-vs-rest classifier

fit(X, y)

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

predict(X)

set_params(**parameters)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: object

check_decisions_multiclass(decisons, y)

Does some basic checks on the decision values returned.

Parameters

decisons –
y –

Returns

confidence_from_scaled_decisions(scaled_decisions, strategy='top_two')

test_LarsAndRidgeBinary_2class(use_ovr=False)

These are ‘visual’ tests. Manually look at the output.

Parameters: use_ovr – you can optionally do the binary test with the OVR too, just to check how the OVR works with binary.
Returns

test_LarsAndRidgeBinary_multiclass()

‘visual’ tests

Returns

compactem.model_builder.LinearProbabilityModel module

class LinearProbabilityModel(complexity_param, *args, **kwargs)

Bases: compactem.model_builder.base_model.ModelBuilderBase

Wrapper methods for building a Linear Probability Model (LPM) - since it has some advantages over logistic regression in terms of interpretability. General idea:

select given number of features using LARS
fit on features using Ridge (note: Ridge might be deprecated)

The complexity param here is the number of terms with non-zero coefficients in a LPM; if the dataset has n classes, a one-vs-all estimator is constructed, and the complexity param applies to each component classifier.

Parameters

complexity_param – number of terms with non-zero coefficients
args –
kwargs –

fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)

No parameter search here since the complexity param and complexity are both number of terms with non-zero coefficients, and is expected that the complexity passed in <= the unbounded complexity found via CV.

Parameters

X – 2D array to perform model selection on
y – corresponding labels
params – model parameter search space
inside_optimizer_iteration – boolean to indicate if function is called inside optimizer
args –
kwargs –

Returns

best model across params search space, parameters for this model

Returns

get_avg_complexity(list_of_estimators, *args, **kwargs)

Calculate “average” complexity across a bunch of ova LPMs: here the “average” is essentially a check that all ova LPMs have the same complexity: if not, this is an error.

Parameters

list_of_estimators – list of ova LPMs
args –
kwargs –

Returns

# non-zero coefficients in the ova LPMs if they are identical, else raise error

get_baseline_fit_params(*args, **kwargs)

Returns a search space for the ridge regression; doesn’t affect number of terms with non-zero coefficients. NOTE: support for ridge might be deprecated soon.

Parameters

args –
kwargs –

Returns

get_complexity(ova_estimator, *args, **kwargs)

Parameters

ova_estimator – one-vs-all LPM classifier
args –
kwargs –

Returns

complexity of ova estimator - this is computed as the number of non-zero coefficients per LPM in the one-vs-all classifier, and raises ValueError if this number is not identical across them

static get_complexity_param_range(X, y, *args, **kwargs)

Grid search via CV is performed to find the best number of non-zero coefficients in the range 1…np.shape(X)[1].

Parameters

X – 2D array to perform model selection on
y – corresponding labels
args –
kwargs –

Returns

list of non-zero coefficients from 1 up to whatever grid search discovered

get_iteration_fit_params(*args, **kwargs)

Returns a search space for the ridge regression; doesn’t affect number of terms with non-zero coefficients. NOTE: support for ridge might be deprecated soon.

Parameters

args –
kwargs –

Returns

compactem.model_builder.RandomForestClassifier module

class RandomForest(complexity_param, *args, **kwargs)

Bases: compactem.model_builder.base_model.ModelBuilderBase

We define a tuple as the complexity (max_depth, n_estimators)

Parameters: complexity_param – tuple (max_depth, n_estimators)

fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)

Only CV is performed, and based on whether the fit is inside the optimizer or outside it, we change the number of folds. Also note: in addition to the params passed in, this adds an additional search space dimension “max_features”: this is done here since we use Breiman’s suggestion as a lower bound, \(\log_2(\text{num_features}) + 1\) which is data dependent. It is determined here based on the input X.

Parameters

X – 2D array to perform model selection on
y – corresponding labels
params – model parameter search space
inside_optimizer_iteration – denotes if this is call from within the optimizer iterations

Returns

scikit’s RandomForestClassifier object, best_parameters in params search space

get_avg_complexity(list_of_estimators, *args, **kwargs)

We compute the median per complexity dimension separately.

Parameters: list_of_estimators – list of scikit’s RandomForestClassifier objects
Returns: median of the complexities i.e. median of Random Forest depths, and number of trees, as calculated by get_complexity()

get_baseline_fit_params(*args, **kwargs)

Parameters

args –
kwargs –

Returns

search space has exactly one value of max_depth and n_estimators - the complexity param defined for this object

get_complexity(estimator, *args, **kwargs)

Parameters

estimator – scikit’s RandomForestClassifier object
args –
kwargs –

Returns

median depth (can be float), number of trees

static get_complexity_param_range(X, y, hold_fixed=None, *args, **kwargs)

Provides the complexity range for a RF in terms of max_depth and n_estimators. For experiments, where both parameters don’t need to change, one may specify the parameter to hold constant, and what value.

Parameters

X – 2D array to perform model selection on
y – corresponding labels
hold_fixed – dict to specify which parameter to hold fixed. For ex, if we want to hold max_depth at 5, this argument should be {‘max_depth’: 5}
args –
kwargs –

Returns

get_iteration_fit_params(*args, **kwargs)

Parameters

args –
kwargs –

Returns

search space has exactly one value of max_depth and n_estimators - the complexity param defined for this object

calculate_RandomForest_complexity(estimator)

We calculate the complexity of a Random Forest as the (median of tree depths, number of trees). This logic doesn’t depend on the dataset, and defining at the module level means we can reuse it at different places.

Parameters: estimator – scikit’s RandomForestClassifier object
Returns: median depth (can be float), number of trees

compactem.model_builder.base_model module

class ModelBuilderBase(complexity_param, *args, **kwargs)

Bases: object

Pythons’ enforcement of abstract class is weak in the sense that deriving classes don’t need to match the function signature. The signature provided herein should be used as “documentation” if you don’t want stuff to break.

Guiding principles:

abstract methods to be implemented in subclass.
methods that mention “Do not override in subclass.” in their doc string typically shouldn’t be overriden, unless you want to change some fundamental behavior.
all else is optional and whether to implement should be decided based on the doc string.

The complexity parameter is fixed for an object of this class. In other words, an object of this class can build models only for a fixed complexity, controlled by the complexity_param.

Parameters: complexity_param – the param that decides the complexity of the model to be learned.

data_split_resolver(dataset_identifier)

Do not override in subclass.

A convenience function that allow referring to splits by name. A helper function to __resolve_datasets_for_fit_and_eval__(). If a tuple (X, y) of data is passed, this transparently returns it with no processing.

Parameters: dataset_identifier – can be a string (‘train’, ‘val’, ‘train_val’, ‘test’) or a tuple
Returns: data X, y

fit_and_evaluate(fit_on, eval_on, params, inside_optimizer_iteration=False, num_train_points=None, **kwargs)

Do not override in subclass.

This is a wrapper around fit_and_evaluate_on_data() which subclass must implement.

Parameters

fit_on – data to train on, can be a string (‘train’, ‘val’, ‘train_val’, ‘test’) or a tuple
eval_on – validation data, can be a string (‘train’, ‘val’, ‘train_val’, ‘test’) or a tuple
params – parameters for model selection
inside_optimizer_iteration – denotes if this is call from within the optimizer iterations - it might make sense to define the function based on the location of call
num_train_points – number of training points (stratified) to use from fit_on. If None use all points.

Returns

score on validation data, best model learned on fit_on data, best paramters from params

abstract fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)

This is the key model training function: it implements how a model is trained on a dataset, given a parameter range to search. Other functions in this class rely on this method.

Parameters

X – 2D array to perform model selection on
y – labels
params – param range to search - no fixed format, since subclass decides how “params” is produced by other functions like get_baseline_fit_params(), which also subclass implements. Must be consistent.
inside_optimizer_iteration – denotes if this is call from within the optimizer iterations - it might make sense to define the function based on the location of call

Returns

best_model (must support predict()), best_params

Attention

Remember to handle cases when the data passed in might not be “proper”, e.g. sample has only points of one label. This might happen when the optimizer is exploring the search space. Handle such cases by returning an accuracy of 0, so that the optimizer learns to avoid them.

fit_baseline_model(all_baselines=False, num_train_points=None)

Do not override in subclass.

Fit the baseline model on different splits. Reuse fit_and_evaluate() here.

Parameters

all_baselines – do we need all combinations of baseline models? Probably not, for practical use. Combinations here mean train on train and report score on train, train on train report on validation etc. This option was used initially for research.
num_train_points – number of points to use from the training split

Returns

return various scores and models (consult source)

fit_model_within_iteration(X, y, **kwargs)

Do not override in subclass.

This fit method is invoked within the optimization loop. We reuse fit_and_evaluate() here.

Parameters

X – data generated with current density params
y – labels for X
**kwargs –
any other params to be passed on to fit_and_evaluate

Returns

scores on train, val and test; and the model fit on train.

abstract get_avg_complexity(list_of_estimators, *args, **kwargs)

Define how would you calculate the average complexity of estimators. For ex in the case of decision trees this could be the median decision tree depth.

Parameters: list_of_estimators –
Returns: average of the complexities of the estimators

abstract get_baseline_fit_params(*args, **kwargs)

This function should return a range of parameters, across which a model is to be selected as the baseline model. The format for this range is up to the user since it would be handled by fit_and_evaluate_on_data() which also needs to be implemented in the sub-class.

Returns: parameter range across which the best baseline model is to be picked

abstract get_complexity(estimator, *args, **kwargs)

Get the complexity of the model passed in.

Parameters: estimator – model whose complexity is to be computed
Returns: complexity value

static get_complexity_param_range(X, y, *args, **kwargs)

An implementation should return the range of sizes, as an iterable, that models need to be built for. This is helpful only if user wants this range to be derived based on some data (X, y), e.g., if they want to build models with complexity lesser than what natural optimal complexity of the model found via cross-validation. This is grouped here to keep things related to model building in one place.

This is not an object method because this provides sizes/complexities, that would be required to instantiate an object with. This is intended to be a convenience method, and optional to implement.

Parameters

X – 2D array of data based on which parameter range must be determined
y – corresponding labels

Returns

parameter range

abstract get_iteration_fit_params(*args, **kwargs)

This function should return the range of parameters, across which a model is to be selected within an optimizer iteration. Ideally, this should be cheap to compute since this is within an iteration. The format for this range is up to the user since it would be handled by fit_and_evaluate_on_data() which also needs to be implemented in the sub-class.

Returns: parameter range across which the best model within an optimizer iteration is to be picked

load_data_splits(X_train, y_train, X_train_val, y_train_val, X_val, y_val, X_test, y_test)

Do not override in subclass.

The data splits are assigned in one place so that the overhead due to passing in different function calls are avoided.

Parameters

X_train – 2D array denoting train data
y_train – corresponding train labels
X_train_val – 2D array denoting train+val data
y_train_val – corresponding train+val labels
X_val – 2D array denoting val data
y_val – corresponding val labels
X_test – 2D array denoting test data
y_test – corresponding test labels

Returns

None

static save_model(model, file_path_no_ext, *args, **kwargs)

If this is not implemented by a subclass, an attempt would be made to save model with pickle. This is not an object function, and is grouped here to keep all things related to model building at one place.

Parameters

model – model object to save
file_path_no_ext – the path where the model is to be saved, without the extension. The final path must returned to add to the result files. If a subclass implements this, its good practice to add an extension for usability.

Returns

path where file is saved