Quickstart

Install the Python package (Python >=3.6 supported):

pip install compactem

Note

There might be issues with LightGBM installation on Mac (which our library depends on). See here and here.

Let’s get started with this short example:

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from compactem.oracles import get_calibrated_gbm
from compactem.main import compact_using_oracle
from compactem.utils.data_format import DataInfo
from compactem.model_builder import DecisionTree
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format

# use small N, T for quick results
N, T = 1000, 50
X, y = load_digits(return_X_y=True)

X, _, y, _ = train_test_split(X, y, train_size=N, stratify=y, random_state=0)
dataset_info = DataInfo("digits", (X, y), [3, 4, 5], evals=T)

# if you run this a second time on the same task_dir you might want to set "overwrite=True"
aggr_results_df = compact_using_oracle(datasets_info=dataset_info,
                                       model_builder_class=DecisionTree,
                                       oracle=get_calibrated_gbm,
                                       task_dir=r'output/quickstart')
print("Result summary:")
print(aggr_results_df[['dataset_name', 'complexity', 'avg_original_score',
                       'avg_new_score', 'pct_improvement']])

Here’s the output:

Output - truncated to 2 decimal places. Scores are F1-macro scores.

dataset_name

complexity

avg_original_score

avg_new_score

pct_improvement

0

digits

3

0.39

0.46

18.25

1

digits

4

0.55

0.58

6.87

2

digits

5

0.70

0.71

1.28

You will likely not see those exact numbers, but if you successfully have a table displayed on the console, congratulations, it’s alive!

Here’s what happened in the above example:

  • We wanted to compact decision trees of certain sizes …

  • … using Gradient Boosted Decision Trees as the oracle.

  • Since our algorithm is iterative, we have also provided a budget of iterations.

The pct_improvement shows how much the oracle guided scores, avg_new_score, improve over the original scores, avg_original_score, for a given model complexity. You can also obtain the instances the model selectively trained on - see Additional Stuff.