Quickstart

Install the Python package (Python >=3.6 supported):

pip install compactem

Note

There might be issues with LightGBM installation on Mac (which our library depends on). See here and here.

Let’s get started with this short example:

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from compactem.oracles import get_calibrated_gbm
from compactem.main import compact_using_oracle
from compactem.utils.data_format import DataInfo
from compactem.model_builder import DecisionTree
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format

# use small N, T for quick results
N, T = 1000, 50
X, y = load_digits(return_X_y=True)

X, _, y, _ = train_test_split(X, y, train_size=N, stratify=y, random_state=0)
dataset_info = DataInfo("digits", (X, y), [3, 4, 5], evals=T)

# if you run this a second time on the same task_dir you might want to set "overwrite=True"
aggr_results_df = compact_using_oracle(datasets_info=dataset_info,
                                       model_builder_class=DecisionTree,
                                       oracle=get_calibrated_gbm,
                                       task_dir=r'output/quickstart')
print("Result summary:")
print(aggr_results_df[['dataset_name', 'complexity', 'avg_original_score',
                       'avg_new_score', 'pct_improvement']])

Here’s the output:

Output - truncated to 2 decimal places. Scores are `F1-macro` scores.
	dataset_name	complexity	avg_original_score	avg_new_score	pct_improvement
0	digits	3	0.39	0.46	18.25
1	digits	4	0.55	0.58	6.87
2	digits	5	0.70	0.71	1.28

You will likely not see those exact numbers, but if you successfully have a table displayed on the console, congratulations, it’s alive!

Here’s what happened in the above example:

We wanted to compact decision trees of certain sizes …
… using Gradient Boosted Decision Trees as the oracle.
Since our algorithm is iterative, we have also provided a budget of iterations.

The pct_improvement shows how much the oracle guided scores, avg_new_score, improve over the original scores, avg_original_score, for a given model complexity. You can also obtain the instances the model selectively trained on - see Additional Stuff.