SHREC 2006

- 3D shape retrieval contest -

Home Classification & Relevance assessment Performance measures used
results:
Runfile(s) per participant per Query all Runfiles

Performance measures used

Introduction

For each query there exist a class of highly relevant items and a class of marginally relevant items. The relevance assessment per query can be found here . Because of this subdivision, most of the evaluation measures have been split up as well. The following names are given: evaluation measures(highly relevant) and evaluation measures(relevant). The evaluation measures(higly relevant) are based on the highly relevant items only, while the evaluation measures(relevant) are based on all relevant items (highly relevant items + marginally relevant items).

The uploaded ranked lists are turned into a gain vector by replacing items IDs by their relevance scores. A highly relevant retrieved item corresponds to relevance score 2, a marginally relevant retrieved item corresponds to relevance score 1 and a non-relevant retrieved item corresponds to relevance score 0.

Abbrevations

Throughout this document the following abbrevations are used:

Example

On the basis of the following example all the evaluation measures will be explained.

Evaluation measures of ranked list

Evaluation measures(highly relevant)

* True Positives(highly relevant) = Vh
* False Positives(highly relevant) = Va - Vh
* True Negatives(highly relevant) = Ds + (Vh - Va) - Ch
* False Negatives(highly relevant) = Ch - Vh
* First Tier(highly relevant) = (# visible highly relevant items within the first d1 items of the rankedlist/ d1 ) * 100% (if Va < Ch then d1 = Va else d1 = Ch)
* Second Tier(highly relevant) = (# visible highly relevant items within the first d2 items of the rankedlist/ d2 ) * 100% (if Va < 2 * Ch then d2 = Va else d2 = 2*Ch)
* Precision(highly relevant) = Vh / Va
* Recall(highly relevant) = Vh / Ch
* Average Precision(highly relevant) = average of the Precision(highly relevant) scores after each highly relevant retrieved item
applied to example
* True Positives(highly relevant) = 5
* False Positives(highly relevant) = 14 - 5 = 9
* True Negatives(highly relevant) = 1814 + 5 - 14 - 6 = 1799
* False Negatives(highly relevant) = 6 - 5 = 1
* First Tier(highly relevant) = (4/6) * 100 % = 66.667%
* Second Tier(highly relevant) = (5/12) * 100% = 41.667%
* Precision(highly relevant) = 5 / 14 = 0.35714
* Recall(highly relevant) = 5 / 6 = 0.83333
* Average Precision(highly relevant) = (1/1 + 2/2 + 3/4 + 4/5 + 5/11) / 5 = 0.80090

Evaluation measures(relevant)

* True Positives(relevant) = Vr
* False Positives(relevant) = Va - Vr
* True Negatives(relevant) = Ds + Vr - Va - Cr
* False Negatives(relevant) = Cr - Vr
* First Tier(relevant) = (# visible relevant items within the first d1 items of the rankedlist/ d1 ) * 100% (if Va < Cr then d1 = Va else d1 = Ch)
* Second Tier(relevant) = (# visible relevant items within the first d2 items of the rankedlist/ d2 ) * 100% (if Va < 2 * Cr then d2 = Va else d2 = 2*Ch)
* Precision(relevant) = Vr / Va
* Recall(relevant) = Vr / Cr
* Average Precision(relevant) = average of the Precision(relevant) scores after each relevant retrieved item
applied to example
* True Positives(relevant) = 9
* False Positives(relevant) = 14 - 9 = 5
* True Negatives(relevant) = 1814 + 9 - 14 - 11 = 1798
* False Negatives(relevant) = 11 - 9 = 2
* First Tier(relevant) = (9/11) * 100 % = 81.818%
* Second Tier(relevant) = (9 / 14) * 100% = 64.285%
* Precision(relevant) = 9 / 14 = 0.6428
* Recall(relevant) = 9 / 11 = 0.8181
* Average Precision(relevant) = (1/1 + 2/2 + 3/3 + 4/4 + 5/5 + 6/6 + 7/8 + 8/10 + 9/11) / 9 = 0.94368

Average dynamic recall

The Average dynamic recall (ADR)is defined as follows:[1]
ADR_definition
where,
ADR2_definition
ADR3_definition
ADR4_definition
The collection of visible highly relevant items within the first i items of the rankedlist, if i <= Ch
The collection of visible relevant items within the first i items of the rankedlist, otherwise
applied to example
i encountered CARD(found) ri
1 [2] 1 1
2 [2,2] 2 1
3 [2,2,1] 2 0.667
4 [2,2,1,2] 3 0.75
5 [2,2,1,2,2] 4 0.80
6 [2,2,1,2,2,1] 4 0.667
7 [2,2,1,2,2,1,0] 6 0.857
8 [2,2,1,2,2,1,0,1] 7 0.875
9 [2,2,1,2,2,1,0,1,0] 7 0.777
10 [2,2,1,2,2,1,0,1,0,1] 8 0.8
11 [2,2,1,2,2,1,0,1,0,1,2] 9 0.818
ADR = (1+1+0.667+0.75+0.8+0.667+0.857+0.875+0.777+0.8+0.818)/11 = 0.819

Cumulated gain

Let us denote position i in the gain vector by G[i]. Now the cumulated gain vector CG is defined recursively as [2]:
CG_definition
applied to example
CG' = [2, 4, 5, 7, 9, 10, 10, 11, 11, 12, 14, 14, 14, 14]

Discounted cumulated gain

The discounted cumulated gain vector DCG is defined recursively as [2]:
CG_definition
applied to example
DCG' = [2.0, 4.0, 4.63093, 5.63093, 6.492283, 6.8791356, 6.8791356, 7.212469, 7.212469, 7.5134993, 8.091629, 8.091629, 8.091629, 8.091629]

Normalized cumulated gain

The normalized cumulated gain vector NCG is obtained by dividing CG by the ideal cumulated gain vector ICG [2].
applied to example
ICG' = [2, 4, 6, 8, 10, 12, 13, 14, 15, 16, 17]
NCG' = [2/2, 4/4, 5/6, 7/8, 9/10, 10/12, 10/13, 11/13, 11/14, 12/15, 14/16, 14/17, 14/17, 14/17]
NCG' = [1.0, 1.0, 0.8333333, 0.875, 0.9, 0.8333333, 0.7692308, 0.78571427, 0.73333335, 0.75, 0.8235294, 0.8235294, 0.8235294, 0.8235294]

Normalized Discounted Cumulated Gain

The normalized discounted cumulated gain vector NDCG is obtained by dividing DCG by the ideal discounted cumulated gain vector IDCG [2].
applied to example
IDCG' = [2.0, 4.0, 5.2618594, 6.2618594, 7.123213, 7.8969183, 8.253125, 8.586458, 8.901923, 9.202953, 9.492018]
NCG' = [2.0/2.0, 4.0/4,0, 5.2618594/5.2618594, ...., 8.091629/9.492018, 8.091629/9.492018]
NDCG' = [1.0, 1.0, 0.8800938, 0.89924246, 0.9114262, 0.87111646, 0.83351886, 0.83998185, 0.8102147, 0.81642264 0.8524667, 0.8524667, 0.8524667, 0.8524667]

Average (discounted) cumulated gain versus % recall (graph only)

At each relevant item in the ranked list the recall percentage and the average (discounted) cumulated gain is plotted.

Evaluation measures of runfile

Mean Average Precision(highly relevant/relevant)

The mean of the Average Precision(highly relevant/relevant) scores over all 30 queries.

Mean First Tier(highly relevant/relevant)

The mean of the First Tier(highly relevant/relevant) scores over all 30 queries.

Mean Second Tier(highly relevant/relevant)

The mean of the Second Tier(highly relevant/relevant) scores over all 30 queries.

Mean Average Dynamic Recall

The mean of the Average Dynamic Recall scores over all 30 queries.

Mean (Normalized) Cumulated (Discounted) Gain @ x

The mean of (Normalized) Cumulated (Discounted) Gain @ x (absolute rank) scores over all 30 queries.

References

[1] Rainer Typke, Remco C. Veltkamp, Frans Wiering. Evaluating retrieval techniques based on partially ordered ground truth lists. In: Proceedings International Conference on Multimedia & Expo (ICME) 2006.
[2] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. In: ACM Transactions on Informations Systems, 20(4):422-446, 2002.