compactem.model_builder package
Package for the model builder base class, as well as implementations.
Submodules
compactem.model_builder.DecisionTreeModelBuilder module
- class DecisionTree(complexity_param, *args, **kwargs)
Bases:
compactem.model_builder.base_model.ModelBuilderBase
scikits decision tree is used.
The complexity parameter is the max_depth.
- Parameters
complexity_param – max_depth of the tree, scalar int
args –
kwargs –
- fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)
If
inside_optimizer_iteration
is set toTrue
a held-out set is used for model selection in the params search space. This is repeated a few times per param. IfFalse
, i.e., this function call occurs outside of the optimization step, we perform a CV-based grid search.- Parameters
X – 2D array to perform model selection on
y – corresponding labels
params – model parameter search space
inside_optimizer_iteration – boolean to indicate if function is called inside optimizer
args –
kwargs –
- Returns
best model across params search space, parameters for this model
- get_avg_complexity(list_of_estimators, *args, **kwargs)
Average is defined as median depth cast to int.
- Parameters
list_of_estimators – list of decision tree models
args –
kwargs –
- Returns
median tree depth
- get_baseline_fit_params(*args, **kwargs)
- Parameters
args –
kwargs –
- Returns
param search space with min_impurity_decrease and max_depth
- get_complexity(estimator, *args, **kwargs)
- Parameters
estimator – decision tree object
args –
kwargs –
- Returns
depth of tree
- static get_complexity_param_range(X, y, *args, **kwargs)
Perform a grid search CV till a max depth.
- Parameters
X – 2D array to perform model selection on
y – corresponding labels
args –
kwargs –
- Returns
list of max_depths from 1…max_depth discovered
- get_iteration_fit_params(*args, **kwargs)
- Parameters
args –
kwargs –
- Returns
param search space with min_impurity_decrease and max_depth
compactem.model_builder.GradientBoostingClassifier module
- class GradientBoostingModel(complexity_param, *args, **kwargs)
Bases:
compactem.model_builder.base_model.ModelBuilderBase
Note
Unlike Random Forest, the complexity in terms of actual tree depths cannot be computed since LightGBM does not expose that information: see here. The max_depth itself is returned as one of the complexity dimensions (the other being number of boosting rounds or trees).
We define a tuple as the complexity (max_depth, num_boosting_rounds). Additional keyword arguments may be provided. Currently supported:
categorical indices: lightgbm can treat dimensions as categorical, a list of categorical indices may be passed in.
learning_rate
- Parameters
complexity_param – tuple (max_depth, num_boosting_rounds)
args –
kwargs –
the following additional keyword arguments are supported:
categorical_idxs: LightGBM’s can treat dimensions as categorical, a list of categorical indices may be passed in. This is its parameter categorical_feature. Default:
'auto'
.learning_rate: LightGBM’s parameter learning_rate. Default:
0.1
.
- fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)
Note
num_threads
should be set to1
because of this issue.- Parameters
X – 2D array to perform model selection on
y – corresponding labels
params – model parameter search space
inside_optimizer_iteration – denotes if this is call from within the optimizer iterations
args –
kwargs –
- Returns
LightGBMWrapper object, best_parameters in params search space
- get_avg_complexity(list_of_estimators, *args, **kwargs)
median of max_depth, boosting rounds
- Parameters
list_of_estimators – list of LightGBMWrapper objects
args –
kwargs –
- Returns
median of max_depth, boosting rounds
- get_baseline_fit_params()
- Returns
dict of params, see code.
- get_complexity(clf, *args, **kwargs)
Complexity is the max_depth and best boosting iteration. The actual depths per tree is not made available in the LightGBM API (see here), so max_depth (same as the supplied initialization) parameter is returned. “max_depth” is the upper bound on what we know about the depth complexity.
Both properties can be acquired from the object init properties, so this function just returns them.
- Parameters
clf – LightGBMWrapper object
args –
kwargs –
- Returns
max_depth, num_boosting_rounds
- static get_complexity_param_range(X, y, *args, **kwargs)
We go over a range of max_depths with a fixed number of (very high) number of boosting rounds with early stopping. This avoids a combinatorial search of the space, since we get the “natural” number of rounds for a given max_depth.
Since this leads to volatility in the number of rounds discovered i.e. the same max_depth might lead (very) different boosting rounds, we train with the same max_depth multiple times, and use the median number of rounds.
TODO: there is still some amount of volatility that needs to be revisited. Maybe we should group results that are close enough and pick the small complexity among the group with the highest representative accuracy?
- Parameters
X – 2D array to perform model selection on
y – corresponding labels
args –
kwargs –
- Returns
- get_iteration_fit_params()
- Returns
dict of params, see code.
- class LightGBMWrapper
Bases:
object
Wrapper around LightGBM. This is eventually used by the ModelBuilder.
- fit(X, y, params, categorical_idxs='auto')
- predict(X)
- predict_proba(X)
Returns per label probability value as a dict :param X: :return:
- get_balanced_sample_weights(y)
This function coputes sample weight in a way that all classes end up with the same total weight. We need to perform this step since LightGBM doesn’t allow class weights.
- Parameters
y – list of labels (we don’t need instances per se to calculate weights)
- Returns
sample weights, list of floats with same length as y
- try_basic_functionality()
compactem.model_builder.LarsAndRidge module
- class LarsAndRidgeBinary(non_zero_terms=1, alphas=None, balance=False, fit_ridge=False)
Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
Objects of this class would be used by scikits OneVsRest classifier.
Warning
I have stopped development of the
fit_ridge
option. Consider it not tested and marked for deprecation.- Parameters
non_zero_terms –
alphas – ignored when fit_ridge=False
balance –
fit_ridge – when we want to use LARs alone, this is faster obviously since there aren’t any Ridge fits
- decision_function(X)
Needed by the one-vs-rest classifier
- fit(X, y)
- get_params(deep=True)
Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
- predict(X)
- set_params(**parameters)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object
- check_decisions_multiclass(decisons, y)
Does some basic checks on the decision values returned.
- Parameters
decisons –
y –
- Returns
- confidence_from_scaled_decisions(scaled_decisions, strategy='top_two')
- test_LarsAndRidgeBinary_2class(use_ovr=False)
These are ‘visual’ tests. Manually look at the output.
- Parameters
use_ovr – you can optionally do the binary test with the OVR too, just to check how the OVR works with binary.
- Returns
- test_LarsAndRidgeBinary_multiclass()
‘visual’ tests
- Returns
compactem.model_builder.LinearProbabilityModel module
- class LinearProbabilityModel(complexity_param, *args, **kwargs)
Bases:
compactem.model_builder.base_model.ModelBuilderBase
Wrapper methods for building a Linear Probability Model (LPM) - since it has some advantages over logistic regression in terms of interpretability. General idea:
select given number of features using LARS
fit on features using Ridge (note: Ridge might be deprecated)
The complexity param here is the number of terms with non-zero coefficients in a LPM; if the dataset has n classes, a one-vs-all estimator is constructed, and the complexity param applies to each component classifier.
- Parameters
complexity_param – number of terms with non-zero coefficients
args –
kwargs –
- fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)
No parameter search here since the complexity param and complexity are both number of terms with non-zero coefficients, and is expected that the complexity passed in <= the unbounded complexity found via CV.
- Parameters
X – 2D array to perform model selection on
y – corresponding labels
params – model parameter search space
inside_optimizer_iteration – boolean to indicate if function is called inside optimizer
args –
kwargs –
- Returns
best model across params search space, parameters for this model
- Returns
- get_avg_complexity(list_of_estimators, *args, **kwargs)
Calculate “average” complexity across a bunch of ova LPMs: here the “average” is essentially a check that all ova LPMs have the same complexity: if not, this is an error.
- Parameters
list_of_estimators – list of ova LPMs
args –
kwargs –
- Returns
# non-zero coefficients in the ova LPMs if they are identical, else raise error
- get_baseline_fit_params(*args, **kwargs)
Returns a search space for the ridge regression; doesn’t affect number of terms with non-zero coefficients. NOTE: support for ridge might be deprecated soon.
- Parameters
args –
kwargs –
- Returns
- get_complexity(ova_estimator, *args, **kwargs)
- Parameters
ova_estimator – one-vs-all LPM classifier
args –
kwargs –
- Returns
complexity of ova estimator - this is computed as the number of non-zero coefficients per LPM in the one-vs-all classifier, and raises ValueError if this number is not identical across them
- static get_complexity_param_range(X, y, *args, **kwargs)
Grid search via CV is performed to find the best number of non-zero coefficients in the range 1…np.shape(X)[1].
- Parameters
X – 2D array to perform model selection on
y – corresponding labels
args –
kwargs –
- Returns
list of non-zero coefficients from 1 up to whatever grid search discovered
- get_iteration_fit_params(*args, **kwargs)
Returns a search space for the ridge regression; doesn’t affect number of terms with non-zero coefficients. NOTE: support for ridge might be deprecated soon.
- Parameters
args –
kwargs –
- Returns
compactem.model_builder.RandomForestClassifier module
- class RandomForest(complexity_param, *args, **kwargs)
Bases:
compactem.model_builder.base_model.ModelBuilderBase
We define a tuple as the complexity (max_depth, n_estimators)
- Parameters
complexity_param – tuple (max_depth, n_estimators)
- fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)
Only CV is performed, and based on whether the fit is inside the optimizer or outside it, we change the number of folds. Also note: in addition to the params passed in, this adds an additional search space dimension “max_features”: this is done here since we use Breiman’s suggestion as a lower bound, \(\log_2(\text{num_features}) + 1\) which is data dependent. It is determined here based on the input X.
- Parameters
X – 2D array to perform model selection on
y – corresponding labels
params – model parameter search space
inside_optimizer_iteration – denotes if this is call from within the optimizer iterations
- Returns
scikit’s RandomForestClassifier object, best_parameters in params search space
- get_avg_complexity(list_of_estimators, *args, **kwargs)
We compute the median per complexity dimension separately.
- Parameters
list_of_estimators – list of scikit’s RandomForestClassifier objects
- Returns
median of the complexities i.e. median of Random Forest depths, and number of trees, as calculated by
get_complexity()
- get_baseline_fit_params(*args, **kwargs)
- Parameters
args –
kwargs –
- Returns
search space has exactly one value of max_depth and n_estimators - the complexity param defined for this object
- get_complexity(estimator, *args, **kwargs)
- Parameters
estimator – scikit’s RandomForestClassifier object
args –
kwargs –
- Returns
median depth (can be float), number of trees
- static get_complexity_param_range(X, y, hold_fixed=None, *args, **kwargs)
Provides the complexity range for a RF in terms of max_depth and n_estimators. For experiments, where both parameters don’t need to change, one may specify the parameter to hold constant, and what value.
- Parameters
X – 2D array to perform model selection on
y – corresponding labels
hold_fixed – dict to specify which parameter to hold fixed. For ex, if we want to hold max_depth at 5, this argument should be {‘max_depth’: 5}
args –
kwargs –
- Returns
- get_iteration_fit_params(*args, **kwargs)
- Parameters
args –
kwargs –
- Returns
search space has exactly one value of max_depth and n_estimators - the complexity param defined for this object
- calculate_RandomForest_complexity(estimator)
We calculate the complexity of a Random Forest as the (median of tree depths, number of trees). This logic doesn’t depend on the dataset, and defining at the module level means we can reuse it at different places.
- Parameters
estimator – scikit’s RandomForestClassifier object
- Returns
median depth (can be float), number of trees
compactem.model_builder.base_model module
- class ModelBuilderBase(complexity_param, *args, **kwargs)
Bases:
object
Pythons’ enforcement of abstract class is weak in the sense that deriving classes don’t need to match the function signature. The signature provided herein should be used as “documentation” if you don’t want stuff to break.
Guiding principles:
abstract methods to be implemented in subclass.
methods that mention “Do not override in subclass.” in their doc string typically shouldn’t be overriden, unless you want to change some fundamental behavior.
all else is optional and whether to implement should be decided based on the doc string.
The complexity parameter is fixed for an object of this class. In other words, an object of this class can build models only for a fixed complexity, controlled by the complexity_param.
- Parameters
complexity_param – the param that decides the complexity of the model to be learned.
- data_split_resolver(dataset_identifier)
Do not override in subclass.
A convenience function that allow referring to splits by name. A helper function to
__resolve_datasets_for_fit_and_eval__()
. If a tuple (X, y) of data is passed, this transparently returns it with no processing.- Parameters
dataset_identifier – can be a string (‘train’, ‘val’, ‘train_val’, ‘test’) or a tuple
- Returns
data X, y
- fit_and_evaluate(fit_on, eval_on, params, inside_optimizer_iteration=False, num_train_points=None, **kwargs)
Do not override in subclass.
This is a wrapper around
fit_and_evaluate_on_data()
which subclass must implement.- Parameters
fit_on – data to train on, can be a string (‘train’, ‘val’, ‘train_val’, ‘test’) or a tuple
eval_on – validation data, can be a string (‘train’, ‘val’, ‘train_val’, ‘test’) or a tuple
params – parameters for model selection
inside_optimizer_iteration – denotes if this is call from within the optimizer iterations - it might make sense to define the function based on the location of call
num_train_points – number of training points (stratified) to use from
fit_on
. IfNone
use all points.
- Returns
score on validation data, best model learned on
fit_on
data, best paramters fromparams
- abstract fit_and_select_model(X, y, params, inside_optimizer_iteration=False, *args, **kwargs)
This is the key model training function: it implements how a model is trained on a dataset, given a parameter range to search. Other functions in this class rely on this method.
- Parameters
X – 2D array to perform model selection on
y – labels
params – param range to search - no fixed format, since subclass decides how “params” is produced by other functions like
get_baseline_fit_params()
, which also subclass implements. Must be consistent.inside_optimizer_iteration – denotes if this is call from within the optimizer iterations - it might make sense to define the function based on the location of call
- Returns
best_model (must support
predict()
), best_params
Attention
Remember to handle cases when the data passed in might not be “proper”, e.g. sample has only points of one label. This might happen when the optimizer is exploring the search space. Handle such cases by returning an accuracy of 0, so that the optimizer learns to avoid them.
- fit_baseline_model(all_baselines=False, num_train_points=None)
Do not override in subclass.
Fit the baseline model on different splits. Reuse fit_and_evaluate() here.
- Parameters
all_baselines – do we need all combinations of baseline models? Probably not, for practical use. Combinations here mean train on train and report score on train, train on train report on validation etc. This option was used initially for research.
num_train_points – number of points to use from the training split
- Returns
return various scores and models (consult source)
- fit_model_within_iteration(X, y, **kwargs)
Do not override in subclass.
This fit method is invoked within the optimization loop. We reuse fit_and_evaluate() here.
- Parameters
X – data generated with current density params
y – labels for X
**kwargs –
any other params to be passed on to fit_and_evaluate
- Returns
scores on train, val and test; and the model fit on train.
- abstract get_avg_complexity(list_of_estimators, *args, **kwargs)
Define how would you calculate the average complexity of estimators. For ex in the case of decision trees this could be the median decision tree depth.
- Parameters
list_of_estimators –
- Returns
average of the complexities of the estimators
- abstract get_baseline_fit_params(*args, **kwargs)
This function should return a range of parameters, across which a model is to be selected as the baseline model. The format for this range is up to the user since it would be handled by
fit_and_evaluate_on_data()
which also needs to be implemented in the sub-class.- Returns
parameter range across which the best baseline model is to be picked
- abstract get_complexity(estimator, *args, **kwargs)
Get the complexity of the model passed in.
- Parameters
estimator – model whose complexity is to be computed
- Returns
complexity value
- static get_complexity_param_range(X, y, *args, **kwargs)
An implementation should return the range of sizes, as an iterable, that models need to be built for. This is helpful only if user wants this range to be derived based on some data (X, y), e.g., if they want to build models with complexity lesser than what natural optimal complexity of the model found via cross-validation. This is grouped here to keep things related to model building in one place.
This is not an object method because this provides sizes/complexities, that would be required to instantiate an object with. This is intended to be a convenience method, and optional to implement.
- Parameters
X – 2D array of data based on which parameter range must be determined
y – corresponding labels
- Returns
parameter range
- abstract get_iteration_fit_params(*args, **kwargs)
This function should return the range of parameters, across which a model is to be selected within an optimizer iteration. Ideally, this should be cheap to compute since this is within an iteration. The format for this range is up to the user since it would be handled by
fit_and_evaluate_on_data()
which also needs to be implemented in the sub-class.- Returns
parameter range across which the best model within an optimizer iteration is to be picked
- load_data_splits(X_train, y_train, X_train_val, y_train_val, X_val, y_val, X_test, y_test)
Do not override in subclass.
The data splits are assigned in one place so that the overhead due to passing in different function calls are avoided.
- Parameters
X_train – 2D array denoting train data
y_train – corresponding train labels
X_train_val – 2D array denoting train+val data
y_train_val – corresponding train+val labels
X_val – 2D array denoting val data
y_val – corresponding val labels
X_test – 2D array denoting test data
y_test – corresponding test labels
- Returns
None
- static save_model(model, file_path_no_ext, *args, **kwargs)
If this is not implemented by a subclass, an attempt would be made to save model with pickle. This is not an object function, and is grouped here to keep all things related to model building at one place.
- Parameters
model – model object to save
file_path_no_ext – the path where the model is to be saved, without the extension. The final path must returned to add to the result files. If a subclass implements this, its good practice to add an extension for usability.
- Returns
path where file is saved