compactem.utils package
Various utility functions are grouped here, e.g., data format checks.
Submodules
compactem.utils.cv_utils module
- balance_data(X, y)
Returns a class balanced dataset of the same size as numpy.shape(X)[0]
- Parameters
X –
y –
- Returns
- generate_cv_indices(X, y, folds, balance=True)
scikit’s grid search cv method relies on the class_weights/sample_weights param of the underlying classifier/regressor to handle imbalances. Unfortunately not all predictors support this param e.g. LAR This method generates indices that should be used in a CV in the case of class imabalance; the indices take care of balancing the dataset for the training data.
- Parameters
X –
y –
folds –
balance – if classes need to be balanced
- Returns
list of tuples, size of the list = folds, each tuple has these entries (1) list representing the training indices (2) list of indices representing the test indices for that fold
- get_train_and_hold_out_splits(X, y, hold_out_pct=0.2, balance=True)
This method is written to return the dataset itself since we’d be making one version of this data anyway even with the indices. Unlike cross-val where <folds> versions of the dataset need to be stored in memory. The held out set must represent the original distribution, but the training sets must be balanced to facilitate the learner. The balanced training dataset has the same size as the unbalanced training dataset
- Parameters
X –
y –
hold_out_pct –
balance – if classes are to be balanced
- Returns
- robust_train_test_split(*arrays, **options)
A robust version of train_test_split() that returns something even when some classes have one instance- a condition on which sklearn errors out. This is needed while exploring distributions, the objective being “stratify when you can, for what remains assign it to the test”. All iterable arguments are expected to be in numpy array format and the “stratify” argument must necessarily be specified, both unlike train_test_split().
The residual assignments are made to test since the test is typically used to calculate accuracy metrics and it is important to show a 0 score for these left-over labels.
- Parameters
arrays – namesake argument wrt train_test_split()
options – namesake argument wrt train_test_split()
- Returns
- sample_with_conservative_replacement(X, sample_size)
Sample points from a given 2D-array X, without replacement, to have maximal variety of points in the sample. If sample_size <= num points in X, sample without replacement. If sample_size > num_points in X, repeat all of X as many times as possible. And sample the residual quantity without replacement.
Return only the indices of the sample because this is likely going to be used as an internal routine to other functions and we want this to be fast
- Parameters
X – 2D array of points
sample_size – number of points to samples
- Returns
indices of points from X that are in the sample
- split_integer_into_parts(total, parts)
divides a number into a given number of integer parts
- Parameters
total –
parts –
- Returns
- stratified_conservative_sample(X, y, sample_size)
Get a stratified sample from X based on y. Create per class samples based on sample_with_conservative_replacement() and stack them together.
- Parameters
X – 2D array of points
y – labels, which will be used for stratification
sample_size – number of points to sample
- Returns
sample of points from X, corresponding labels
compactem.utils.data_format module
- class DataInfo(dataset_name, data, complexity_params, evals, splits=None, additional_info=None)
Bases:
object
- Parameters
dataset_name – this name is used to create result files, associate with results returned etc
data – (X, y), tuple with 2D array of features vectors and labels. This can be None if the splits parameter contains data splits.
complexity_params – list of complexity params for which a model must be built using this dataset.
evals – number of optimizer iterations
splits – dict with keys ‘train’, ‘val’ and ‘test’, and the values being either percentages of these splits, or tuples with the actual data splits, e.g.,
{'test': (X_t, y_t), 'train': (X_tr, y_tr), 'val': (X_val, y_val)}
. If data splits are provided, explicitly set the data parameter toNone
, since that gets precedence.additional_info (
Optional
[Dict
]) – additional info that is transparently passed on to the Model builder init. This must be a dict.
- class UncertaintyInfo(oracle_accuracy, uncertainty_scores, oracle_name=None)
Bases:
object
- Parameters
oracle_accuracy – a representative accuracy score for the oracle, this is for display/result collation purposes, but is recommended to have for creating helpful analysis.
uncertainty_scores – uncertainty scores for a dataset; list of scalar values in [0, 1].
oracle_name – a name for display/result collation purposes.
- validate_dataset_info(datasets_info)
Check if the dataset info supplied by user is in the correct format. Since this is user-facing, we want to validate this closely and notify early rather than downstream. TODO: it might make sense to move some of the checks here to the DataInfo class
- Parameters
datasets_info (
Iterable
[DataInfo
]) – list ofcompactem.utils.data_format.DataInfo
objects.- Return type
bool
- Returns
True or False, indicating if the datasets passed in clear certain checks.
compactem.utils.data_load module
- entropy(labels, full_label_set=None)
- load_data(*args)
stub
compactem.utils.output_processors module
compactem.utils.utils module
Helper functions to process scikit decision trees.
- class MyLabelBinarizer
Bases:
object
This is different from the original LabelBinarizer in how it handles the 2 class case. We don’t want to return a vector of one dimensional labels in this case, we flatten it.
- fit(y)
- inverse_transform(Y)
- transform(y)
- entropy(labels)
- get_label_colormap(labels)
Get a colormap dict for labels.
- Parameters
labels –
- Returns
dict with key = label and value as color name. None if there are more labels than can be supported.
- get_label_colors(y, label_colormap)
Get colors corresponding to labels
- Parameters
y – labels
label_colormap – obtained from get_label_colormap()
- Returns
list of colors, length of list is the same as y
- is_iterable(obj)
- isclose(a, b, rel_tol=1e-09, abs_tol=0.0)
- pick_best_k(values, k)
This is a quick way to select a bunch of parameters that are “close” together on the real line. The idea is if we have a one-vs-all classifier and each of the classifiers returns a parameter space, then deciding on a global param space across all classes can be difficult.
One soln: if we need p params and we have c classes, then get p params per class - for a total of pc params - then perform a k-means with k=p.
- Parameters
values –
k –
- Returns