For each ensemble model the cross-validation procedure outputs a table with confidences. Rows in the table are examples, columns are labels (from a hierarchical class) and values are probabilities that the labels are associated with the examples. The probabilities indicate how confident the model is in the established associations. The table aggregates examples from n test sets.
Suppose that we have a simple data set with a tree-shaped hierarchical class of five labels connected in the following manner:
The data set has ten examples. Two-fold cross-validation randomly divides examples in two groups of five: (e2, e3, e7, e9, e10) and (e1, e4, e5, e6, e8). The procedure first constructs a model using the baseline algorithm from the first set of examples and uses the model to make prediction for the second. Then it constructs a model from the second set of examples and makes prediction for the first. The resulting table with confidences is:
Example ID | Labels | ||||
---|---|---|---|---|---|
l1 | l2 | l3 | l4 | l5 | |
e1 | 0.12 | 0.87 | 0.05 | 0.61 | 0.79 |
e2 | 0.98 | 0.05 | 0 | 0 | 0.01 |
e3 | 0.02 | 0.59 | 0.05 | 0.24 | 0.59 |
e4 | 0 | 0.99 | 0.81 | 0.33 | 0.4 |
e5 | 0.31 | 0.55 | 0.12 | 0.05 | 0.01 |
e6 | 0.19 | 0.91 | 0.88 | 0.02 | 0 |
e7 | 0.84 | 0.12 | 0.01 | 0 | 0 |
e8 | 0.14 | 0.74 | 0.09 | 0.71 | 0.73 |
e9 | 0.31 | 0.89 | 0.27 | 0.88 | 0.84 |
e10 | 0.92 | 0.05 | 0 | 0 | 0.01 |
The table shows which paths from the hierarchy are more or less likely associated with each of the examples. For example, the baseline model is strongly confident in the association between the example e6 and l2 l3 path from the hierarchy (confidence ≥ 0.88).
Note that confidence values for an individual example (within a row) satisfy hierarchy constraint. In other words, confidences for labels do not surpass the confidence of their parent label.
The pipeline has a separate task that divides examples from an input data set (named baseline data set) into n folds. Once the cross-validation folds are created, the pipeline saves the information on which examples are associated with each of the folds. If you want to compare the algorithms on the same cross-validation folds, run this task only once. Then you can run any combination of the algorithms and all of them will be evaluated on the same cross-validation folds.
From the table with confidences the pipeline computes two types of measures of model's predictive performance:
Suppose that a model predicts a label l for a set of examples. For each example, it will output a confidence. To compute the threshold-dependent measures from the confidences the pipeline follows the steps:
Model's decision to associate l or not-l with an example may or may not be correct. By making such observations over a set of examples, four outcomes can be recorded for the model and the label l:
The outcomes are summarized in a table named confusion matrix:
Actual label Predicted | l | not-l |
---|---|---|
l | TP | FP |
not-l | FN | TN |
From the confusion matrix, the pipeline compute four measures:
The four threshold-dependent measures are computed for the label l5 and threshold 0.5 by following the steps:
Example ID | e1 | e2 | e3 | e4 | e5 | e6 | e7 | e8 | e9 | e10 |
---|---|---|---|---|---|---|---|---|---|---|
Confidences for l5 | 0.79 | 0.01 | 0.59 | 0.4 | 0.01 | 0 | 0 | 0.73 | 0.84 | 0.01 |
Predicted at t = 0.5 | ||||||||||
Actual label |
Actual label Predicted | l | not-l |
---|---|---|
l | 3 | 1 |
not-l | 2 | 4 |
The pipeline outputs a confusion matrix and threshold-dependent measures for each label in a data set. They can be outputted for multiple thresholds. You can define the thresholds in a settings file.
Accuracy is not a good measure of model’s performance in the case of highly unbalanced labels. An example of a highly unbalanced label is when in 100 examples five are labeled with l and 95 as not-l. In such case, a model that would always return not-l as an answer would have 0.95 accuracy. At the same time it would have zero precision, recall and F-measure. Be aware of unbalanced labels in data sets with large hierarchy, where the number of positive examples reduce with the depth of hierarchy.
Threshold-independent measures indicate model's predictive performance over a range of confidence thresholds. The pipeline implements two such measures: AUPRC and AUC. They are computed for each label and as averages over all labels.
Suppose that a model predicts a label l for a set of examples. AUPRC for l is computed in the following manner:
AUC is computed by following the same steps as in the case of AUPRC, with an exception that recall on x-axis is substituted with false positive rate and precision on y-axis with true positive rate.
False positive rate is computed as
\[\text{FPR} = \frac{\text{FP}}{\text{FP + TN}}\]while true positive rate is a synonym for recall.
They summarize model's performance over a set of most specific labels, which are common to all five algorithms. By averaging over the most specific labels, the pipeline ensures fair performance-based comparison between the five algorithms.
The measures are:
Suppose that a model predicts a set of most specific labels M for a set of examples. Area under average precision recall curve for M is computed in the following manner:
When you run an algorithm in a cross-validation mode, the pipeline will compute the described measures and output them to a file named “Evaluation_report.csv”.