Preparing Data

Onehot encoding

Recommendation.onehot — Function

onehot(value, value_set::AbstractVector) -> Vector{Float64}

Encode a categorical value to a onehot-encoded vector. Value must be one of the elements in value_set in Integer or String type. missing or nothing are also acceptable as a value, but they are converted into a zero vector.

source

onehot(vec::AbstractVector) -> Matrix{Float64}

Encode a categorical vector to a onehot-encoded matrix. ["Male", "Female", "Others"] is converted into [1. 0. 0.; 0. 1. 0.; 0. 0. 1.]. An index corresponding to a possible value is assigned in the order of first-time appearance in the input vector.

source

onehot(mat::AbstractMatrix) -> Matrix{Float64}

Each column of an input matrix represents a single categorical vector. Onehot-encode the individual columns and horizontally concatenate them as an output.

source

Load public datasets

Recommendation.load_movielens_100k — Function

load_movielens_100k([path=nothing]) -> DataAccessor

path points to a locally saved MovieLens 100k. Read user-item-rating triples in the folder, and convert them into a DataAccessor instance.

Download and decompress a corresponding zip file, if path is not given or the specified folder does not exist.

source

Recommendation.load_movielens_latest — Function

load_movielens_latest([path=nothing]) -> DataAccessor

path points to a locally saved MovieLens Latest (Small). Read user-item-rating triples in the folder, and convert them into a DataAccessor instance.

Download and decompress a corresponding zip file, if path is not given or the specified folder does not exist.

source

Recommendation.load_amazon_review — Function

load_amazon_review([path=nothing; category="Electronics"]) -> DataAccessor

path points to a locally saved small set of Amazon Reviews Dataset for a particular category. Each row has a tuple of (item, user, rating, timestamp).

source

Recommendation.load_lastfm — Function

load_lastfm([path=nothing]) -> DataAccessor

path points to a locally saved HetRec 2011 Last.FM dataset Each row has a tuple of (user, artist, # of listenings).

source

Test a recommender with cross_validation:

using Recommendation

data = load_movielens_100k()
recall = cross_validation(
                          3,            # N-fold
                          Recall,       # Metric
                          5,            # Top-k
                          MostPopular,  # Recommender
                          data          # Data Accessor
                         )
println(recall)

Generate synthetic data

Recommendation.SyntheticFeature — Type

SyntheticFeature(name::String, candidates::Union{UnitRange, AbstractVector})

Synthetic feature generator that allows us to sample a value from candidates.

features = SyntheticFeature[]

age = SyntheticFeature("Age", 1930:2010))
push!(features, age)

geo = SyntheticFeature("Geo", ["Arizona", "California", "Colorado", "Illinois", "Indiana", "Michigan", "New York", "Utah"])
push!(features, geo)

rand(geo, 3) # e.g., ["California", "New York", "Arizona"]

source

Recommendation.SyntheticRule — Type

SyntheticRule{probability::Float64[, item::Union{Nothing, Integer}, match::Function])

Matching rule for cumulative "click through rate". Given an item index, we increase the probability of acceptance upon sampling when match returns true, which takes a dictionary of feature name => value.

rules = SyntheticRule[]

push!(rules, SyntheticRule(0.001))
push!(rules, SyntheticRule(0.01, 3))
push!(rules, SyntheticRule(0.30, 1, s -> s["Age"] >= 1980 && s["Age"] <= 1989 && s["Geo"] == "New York"))
push!(rules, SyntheticRule(0.30, 2, s -> s["Age"] >= 1950 && s["Age"] <= 1959 && s["Geo"] == "New York"))
push!(rules, SyntheticRule(0.30, 2, s -> s["Age"] >= 1980 && s["Age"] <= 1989 && s["Geo"] == "Arizona"))
push!(rules, SyntheticRule(0.30, 1, s -> s["Age"] >= 1950 && s["Age"] <= 1959 && s["Geo"] == "Arizona"))

source

Recommendation.generate — Function

generate(n_samples::Integer, n_items::Integer, features::AbstractVector{SyntheticFeature}, rules::AbstractVector{SyntheticRule}) -> DataAccessor

Generate a synthetic data accessor from randomly sampled implicit feedback. Each sample is considered as a different user, and user attributes are represented by a numeric onehot-encoded feature vector based on the values returned by SyntheticFeature.

The process is based on Section 7.3 of the following paper:

M. Aharon, et al. OFF-Set: One-pass Factorization of Feature Sets for Online Recommendation in Persistent Cold Start Settings. arXiv:1308.1792.

n_samples = 256
n_items = 5
data = generate(n_samples, n_items, features, rules)

source

Helper functions

Recommendation.get_data_home — Function

get_data_home([data_home=nothing]) -> String

Return an absolute path to a directory containing datasets. Create the directory if it does not exist, and data_home=nothing defaults to either an environmental variable JULIA_RECOMMENDATION_DATA or ~/julia_recommendation_data.

source

Recommendation.download_file — Function

download_file(url, path=nothing) -> path

Download a dataset from the URL to the path. Create folders if needed. path=nothing defaults to tempname() as a destination.

source

Recommendation.unzip — Function

unzip(path[, exdir=nothing]) -> exdir

Extract files in a zip file at path into a directory exdir. Extract into the same directory as the zip file if exdir=nothing.

Reference: https://github.com/fhs/ZipFile.jl/pull/16

source

Recommendation.load_libsvm_file — Function

load_libsvm_file(path::String; zero_based::Bool=false, n_features::Union{Integer, Nothing}=nothing) -> X, y

Read a sparse matrix dataset represented by libsvm format. Each row of X and y corresponds to a pair of sample that is already transformed to a vector; since there is no concept of user/item ID in the data, users may need to manually translate the resulting matrix/vector to DataAccessor as:

n_users, n_items = ...

X, y = load_libsvm_file("/path/to/data.libsvm")

events = Event[]
for i in 1:length(y)
    user, item = ... # find user/item ID per row based on your definition
    push!(events, Event(user, item, y[i]))

data = DataAccessor(events, n_users, n_items)

# declare which columns in X represent user and item features, respectively
user_feature_indices = [...]
item_feature_indices = [...]

for i in 1:size(X, 1)
    user, item = ... # find user/item ID per row based on your definition
    set_user_attributes(data, user, X[i, user_feature_indices])
    set_item_attributes(data, item, X[i, item_feature_indices])

source