Evaluation

Cross validation

Recommendation.cross_validationFunction
cross_validation(
    n_fold::Integer,
    metric::Type{<:RankingMetric},
    k::Integer,
    recommender_type::Type{<:Recommender},
    data::DataAccessor,
    recommender_args...
)

Conduct n_fold cross validation for a combination of recommender recommender_type and ranking metric metric. A recommender is initialized with recommender_args and runs top-k recommendation.

source
cross_validation(
    n_fold::Integer,
    metric::Type{<:AccuracyMetric},
    recommender_type::Type{<:Recommender},
    data::DataAccessor,
    recommender_args...
)

Conduct n_fold cross validation for a combination of recommender recommender_type and accuracy metric metric. A recommender is initialized with recommender_args.

source

Rating metric

Recommendation.RMSEType
RMSE

Root Mean Squared Error.

measure(
    metric::RMSE,
    truth::AbstractVector,
    pred::AbstractVector
)

truth and pred are expected to be the same size.

source
Recommendation.MAEType
MAE

Mean Absolute Error.

measure(
    metric::MAE,
    truth::AbstractVector,
    pred::AbstractVector
)

truth and pred are expected to be the same size.

source

Ranking metric

Let a target user $u \in \mathcal{U}$, set of all items $\mathcal{I}$, ordered set of top-$N$ recommended items $I_N(u) \subset \mathcal{I}$, and set of truth items $\mathcal{I}^+_u$.

Recommendation.RecallType
Recall

Recall-at-$N$ (Recall@$N$) indicates coverage of truth samples as a result of top-$N$ recommendation. The value is computed by the following equation:

\[\mathrm{Recall@}N = \frac{|\mathcal{I}^+_u \cap I_N(u)|}{|\mathcal{I}^+_u|}.\]

Here, $|\mathcal{I}^+_u \cap I_N(u)|$ is the number of true positives.

measure(
    metric::Recall,
    truth::Array{T},
    pred::Array{T},
    k::Integer
)
source
Recommendation.PrecisionType
Precision

Precision-at-$N$ (Precision@$N$) evaluates correctness of a top-$N$ recommendation list $I_N(u)$ according to the portion of true positives in the list as:

\[\mathrm{Precision@}N = \frac{|\mathcal{I}^+_u \cap I_N(u)|}{|I_N(u)|}.\]

measure(
    metric::Precision,
    truth::Array{T},
    pred::Array{T},
    k::Integer
)
source
Recommendation.MAPType
MAP

While the original Precision@$N$ provides a score for a fixed-length recommendation list $I_N(u)$, mean average precision (MAP) computes an average of the scores over all recommendation sizes from 1 to $|\mathcal{I}|$. MAP is formulated with an indicator function for $i_n$, the $n$-th item of $I(u)$, as:

\[\mathrm{MAP} = \frac{1}{|\mathcal{I}^+_u|} \sum_{n = 1}^{|\mathcal{I}|} \mathrm{Precision@}n \cdot \mathbb{1}_{\mathcal{I}^+_u}(i_n).\]

It should be noticed that, MAP is not a simple mean of sum of Precision@$1$, Precision@$2$, $\dots$, Precision@$|\mathcal{I}|$, and higher-ranked true positives lead better MAP.

measure(
    metric::MAP,
    truth::Array{T},
    pred::Array{T}
)
source
Recommendation.AUCType
AUC

ROC curve and area under the ROC curve (AUC) are generally used in evaluation of the classification problems, but these concepts can also be interpreted in a context of ranking problem. Basically, the AUC metric for ranking considers all possible pairs of truth and other items which are respectively denoted by $i^+ \in \mathcal{I}^+_u$ and $i^- \in \mathcal{I}^-_u$, and it expects that the $best'' recommender completely ranks$i^+$higher than$i^-``, as follows:

auc

AUC calculation keeps track the number of true positives at different rank in $\mathcal{I}$. At line 8, the function adds the number of true positives which were ranked higher than the current non-truth sample to the accumulated count of correct pairs. Ultimately, an AUC score is computed as portion of the correct ordered $(i^+, i^-)$ pairs in the all possible combinations determined by $|\mathcal{I}^+_u| \times |\mathcal{I}^-_u|$ in set notation.

measure(
    metric::AUC,
    truth::Array{T},
    pred::Array{T}
)
source
Recommendation.ReciprocalRankType
ReciprocalRank

If we are only interested in the first true positive, reciprocal rank (RR) could be a reasonable choice to quantitatively assess the recommendation lists. For $n_{\mathrm{tp}} \in \left[ 1, |\mathcal{I}| \right]$, a position of the first true positive in $I(u)$, RR simply returns its inverse:

\[ \mathrm{RR} = \frac{1}{n_{\mathrm{tp}}}.\]

RR can be zero if and only if $\mathcal{I}^+_u$ is empty.

measure(
    metric::ReciprocalRank,
    truth::Array{T},
    pred::Array{T}
)
source
Recommendation.MPRType
MPR

Mean percentile rank (MPR) is a ranking metric based on $r_{i} \in [0, 100]$, the percentile-ranking of an item $i$ within the sorted list of all items for a user $u$. It can be formulated as:

\[\mathrm{MPR} = \frac{1}{|\mathcal{I}^+_u|} \sum_{i \in \mathcal{I}^+_u} r_{i}.\]

$r_{i} = 0\%$ is the best value that means the truth item $i$ is ranked at the highest position in a recommendation list. On the other hand, $r_{i} = 100\%$ is the worst case that the item $i$ is at the lowest rank.

MPR internally considers not only top-$N$ recommended items also all of the non-recommended items, and it accumulates the percentile ranks for all true positives unlike MRR. So, the measure is suitable to estimate users' overall satisfaction for a recommender. Intuitively, $\mathrm{MPR} > 50\%$ should be worse than random ranking from a users' point of view.

measure(
    metric::MPR,
    truth::Array{T},
    pred::Array{T}
)
source
Recommendation.NDCGType
NDCG

Like MPR, normalized discounted cumulative gain (NDCG) computes a score for $I(u)$ which places emphasis on higher-ranked true positives. In addition to being a more well-formulated measure, the difference between NDCG and MPR is that NDCG allows us to specify an expected ranking within $\mathcal{I}^+_u$; that is, the metric can incorporate $\mathrm{rel}_n$, a relevance score which suggests how likely the $n$-th sample is to be ranked at the top of a recommendation list, and it directly corresponds to an expected ranking of the truth samples.

measure(
    metric::NDCG,
    truth::Array{T},
    pred::Array{T},
    k::Integer
)
source