compactem.utils package

Various utility functions are grouped here, e.g., data format checks.

Submodules

compactem.utils.cv_utils module

balance_data(X, y)

Returns a class balanced dataset of the same size as numpy.shape(X)[0]

Parameters

X –
y –

Returns

generate_cv_indices(X, y, folds, balance=True)

scikit’s grid search cv method relies on the class_weights/sample_weights param of the underlying classifier/regressor to handle imbalances. Unfortunately not all predictors support this param e.g. LAR This method generates indices that should be used in a CV in the case of class imabalance; the indices take care of balancing the dataset for the training data.

Parameters

X –
y –
folds –
balance – if classes need to be balanced

Returns

list of tuples, size of the list = folds, each tuple has these entries (1) list representing the training indices (2) list of indices representing the test indices for that fold

get_train_and_hold_out_splits(X, y, hold_out_pct=0.2, balance=True)

This method is written to return the dataset itself since we’d be making one version of this data anyway even with the indices. Unlike cross-val where <folds> versions of the dataset need to be stored in memory. The held out set must represent the original distribution, but the training sets must be balanced to facilitate the learner. The balanced training dataset has the same size as the unbalanced training dataset

Parameters

X –
y –
hold_out_pct –
balance – if classes are to be balanced

Returns

robust_train_test_split(*arrays, **options)

A robust version of train_test_split() that returns something even when some classes have one instance- a condition on which sklearn errors out. This is needed while exploring distributions, the objective being “stratify when you can, for what remains assign it to the test”. All iterable arguments are expected to be in numpy array format and the “stratify” argument must necessarily be specified, both unlike train_test_split().

The residual assignments are made to test since the test is typically used to calculate accuracy metrics and it is important to show a 0 score for these left-over labels.

Parameters

arrays – namesake argument wrt train_test_split()
options – namesake argument wrt train_test_split()

Returns

sample_with_conservative_replacement(X, sample_size)

Sample points from a given 2D-array X, without replacement, to have maximal variety of points in the sample. If sample_size <= num points in X, sample without replacement. If sample_size > num_points in X, repeat all of X as many times as possible. And sample the residual quantity without replacement.

Return only the indices of the sample because this is likely going to be used as an internal routine to other functions and we want this to be fast

Parameters

X – 2D array of points
sample_size – number of points to samples

Returns

indices of points from X that are in the sample

split_integer_into_parts(total, parts)

divides a number into a given number of integer parts

Parameters

total –
parts –

Returns

stratified_conservative_sample(X, y, sample_size)

Get a stratified sample from X based on y. Create per class samples based on sample_with_conservative_replacement() and stack them together.

Parameters

X – 2D array of points
y – labels, which will be used for stratification
sample_size – number of points to sample

Returns

sample of points from X, corresponding labels

compactem.utils.data_format module

class DataInfo(dataset_name, data, complexity_params, evals, splits=None, additional_info=None)

Bases: object

Parameters

dataset_name – this name is used to create result files, associate with results returned etc
data – (X, y), tuple with 2D array of features vectors and labels. This can be None if the splits parameter contains data splits.
complexity_params – list of complexity params for which a model must be built using this dataset.
evals – number of optimizer iterations
splits – dict with keys ‘train’, ‘val’ and ‘test’, and the values being either percentages of these splits, or tuples with the actual data splits, e.g., {'test': (X_t, y_t), 'train': (X_tr, y_tr), 'val': (X_val, y_val)}. If data splits are provided, explicitly set the data parameter to None, since that gets precedence.
additional_info (Optional[Dict]) – additional info that is transparently passed on to the Model builder init. This must be a dict.

class UncertaintyInfo(oracle_accuracy, uncertainty_scores, oracle_name=None)

Bases: object

Parameters

oracle_accuracy – a representative accuracy score for the oracle, this is for display/result collation purposes, but is recommended to have for creating helpful analysis.
uncertainty_scores – uncertainty scores for a dataset; list of scalar values in [0, 1].
oracle_name – a name for display/result collation purposes.

validate_dataset_info(datasets_info)

Check if the dataset info supplied by user is in the correct format. Since this is user-facing, we want to validate this closely and notify early rather than downstream. TODO: it might make sense to move some of the checks here to the DataInfo class

Parameters: datasets_info (Iterable[DataInfo]) – list of compactem.utils.data_format.DataInfo objects.
Return type: bool
Returns: True or False, indicating if the datasets passed in clear certain checks.

compactem.utils.data_load module

entropy(labels, full_label_set=None)

load_data(*args): stub

compactem.utils.output_processors module

compactem.utils.utils module

Helper functions to process scikit decision trees.

class MyLabelBinarizer

Bases: object

This is different from the original LabelBinarizer in how it handles the 2 class case. We don’t want to return a vector of one dimensional labels in this case, we flatten it.

fit(y)

inverse_transform(Y)

transform(y)

entropy(labels)

get_label_colormap(labels)

Get a colormap dict for labels.

Parameters: labels –
Returns: dict with key = label and value as color name. None if there are more labels than can be supported.

get_label_colors(y, label_colormap)

Get colors corresponding to labels

Parameters

y – labels
label_colormap – obtained from get_label_colormap()

Returns

list of colors, length of list is the same as y

is_iterable(obj)

isclose(a, b, rel_tol=1e-09, abs_tol=0.0)

pick_best_k(values, k)

This is a quick way to select a bunch of parameters that are “close” together on the real line. The idea is if we have a one-vs-all classifier and each of the classifiers returns a parameter space, then deciding on a global param space across all classes can be difficult.

One soln: if we need p params and we have c classes, then get p params per class - for a total of pc params - then perform a k-means with k=p.

Parameters

values –
k –

Returns