Preparing Data
Onehot encoding
Recommendation.onehot
— Functiononehot(value, value_set::AbstractVector) -> Vector{Float64}
Encode a categorical value to a onehot-encoded vector. Value must be one of the elements in value_set
in Integer
or String
type. missing
or nothing
are also acceptable as a value, but they are converted into a zero vector.
onehot(vec::AbstractVector) -> Matrix{Float64}
Encode a categorical vector to a onehot-encoded matrix. ["Male", "Female", "Others"]
is converted into [1. 0. 0.; 0. 1. 0.; 0. 0. 1.]
. An index corresponding to a possible value is assigned in the order of first-time appearance in the input vector.
onehot(mat::AbstractMatrix) -> Matrix{Float64}
Each column of an input matrix represents a single categorical vector. Onehot-encode the individual columns and horizontally concatenate them as an output.
Load public datasets
Recommendation.load_movielens_100k
— Functionload_movielens_100k([path=nothing]) -> DataAccessor
path
points to a locally saved MovieLens 100k. Read user-item-rating triples in the folder, and convert them into a DataAccessor
instance.
Download and decompress a corresponding zip file, if path
is not given or the specified folder does not exist.
Recommendation.load_movielens_latest
— Functionload_movielens_latest([path=nothing]) -> DataAccessor
path
points to a locally saved MovieLens Latest (Small). Read user-item-rating triples in the folder, and convert them into a DataAccessor
instance.
Download and decompress a corresponding zip file, if path
is not given or the specified folder does not exist.
Recommendation.load_amazon_review
— Functionload_amazon_review([path=nothing; category="Electronics"]) -> DataAccessor
path
points to a locally saved small set of Amazon Reviews Dataset for a particular category. Each row has a tuple of (item, user, rating, timestamp).
Recommendation.load_lastfm
— Functionload_lastfm([path=nothing]) -> DataAccessor
path
points to a locally saved HetRec 2011 Last.FM dataset Each row has a tuple of (user, artist, # of listenings).
Test a recommender with cross_validation
:
using Recommendation
data = load_movielens_100k()
recall = cross_validation(
3, # N-fold
Recall, # Metric
5, # Top-k
MostPopular, # Recommender
data # Data Accessor
)
println(recall)
Generate synthetic data
Recommendation.SyntheticFeature
— TypeSyntheticFeature(name::String, candidates::Union{UnitRange, AbstractVector})
Synthetic feature generator that allows us to sample a value from candidates
.
features = SyntheticFeature[]
age = SyntheticFeature("Age", 1930:2010))
push!(features, age)
geo = SyntheticFeature("Geo", ["Arizona", "California", "Colorado", "Illinois", "Indiana", "Michigan", "New York", "Utah"])
push!(features, geo)
rand(geo, 3) # e.g., ["California", "New York", "Arizona"]
Recommendation.SyntheticRule
— TypeSyntheticRule{probability::Float64[, item::Union{Nothing, Integer}, match::Function])
Matching rule for cumulative "click through rate". Given an item index, we increase the probability of acceptance upon sampling when match
returns true
, which takes a dictionary of feature name => value.
rules = SyntheticRule[]
push!(rules, SyntheticRule(0.001))
push!(rules, SyntheticRule(0.01, 3))
push!(rules, SyntheticRule(0.30, 1, s -> s["Age"] >= 1980 && s["Age"] <= 1989 && s["Geo"] == "New York"))
push!(rules, SyntheticRule(0.30, 2, s -> s["Age"] >= 1950 && s["Age"] <= 1959 && s["Geo"] == "New York"))
push!(rules, SyntheticRule(0.30, 2, s -> s["Age"] >= 1980 && s["Age"] <= 1989 && s["Geo"] == "Arizona"))
push!(rules, SyntheticRule(0.30, 1, s -> s["Age"] >= 1950 && s["Age"] <= 1959 && s["Geo"] == "Arizona"))
Recommendation.generate
— Functiongenerate(n_samples::Integer, n_items::Integer, features::AbstractVector{SyntheticFeature}, rules::AbstractVector{SyntheticRule}) -> DataAccessor
Generate a synthetic data accessor from randomly sampled implicit feedback. Each sample is considered as a different user, and user attributes are represented by a numeric onehot-encoded feature vector based on the values returned by SyntheticFeature
.
The process is based on Section 7.3 of the following paper:
- M. Aharon, et al. OFF-Set: One-pass Factorization of Feature Sets for Online Recommendation in Persistent Cold Start Settings. arXiv:1308.1792.
n_samples = 256
n_items = 5
data = generate(n_samples, n_items, features, rules)
Helper functions
Recommendation.get_data_home
— Functionget_data_home([data_home=nothing]) -> String
Return an absolute path to a directory containing datasets. Create the directory if it does not exist, and data_home=nothing
defaults to either an environmental variable JULIA_RECOMMENDATION_DATA
or ~/julia_recommendation_data
.
Reference: Similar function in scikit-learn.
Recommendation.download_file
— Functiondownload_file(url, path=nothing) -> path
Download a dataset from the URL to the path. Create folders if needed. path=nothing
defaults to tempname()
as a destination.
Recommendation.unzip
— Functionunzip(path[, exdir=nothing]) -> exdir
Extract files in a zip file at path
into a directory exdir
. Extract into the same directory as the zip file if exdir=nothing
.
Reference: https://github.com/fhs/ZipFile.jl/pull/16
Recommendation.load_libsvm_file
— Functionload_libsvm_file(path::String; zero_based::Bool=false, n_features::Union{Integer, Nothing}=nothing) -> X, y
Read a sparse matrix dataset represented by libsvm format. Each row of X and y corresponds to a pair of sample that is already transformed to a vector; since there is no concept of user/item ID in the data, users may need to manually translate the resulting matrix/vector to DataAccessor
as:
n_users, n_items = ...
X, y = load_libsvm_file("/path/to/data.libsvm")
events = Event[]
for i in 1:length(y)
user, item = ... # find user/item ID per row based on your definition
push!(events, Event(user, item, y[i]))
data = DataAccessor(events, n_users, n_items)
# declare which columns in X represent user and item features, respectively
user_feature_indices = [...]
item_feature_indices = [...]
for i in 1:size(X, 1)
user, item = ... # find user/item ID per row based on your definition
set_user_attributes(data, user, X[i, user_feature_indices])
set_item_attributes(data, item, X[i, item_feature_indices])