Contents

This module implements "factor model" for representing an incomplete matrix using a low-rank approximation [1]. Mathematically, this model seeks to find matrices U and V (also referred as factors) that, for any given incomplete matrix A, minimizes:

$\|\boldsymbol A - \boldsymbol UV^{T} \|_2$

subject to $rank(\boldsymbol UV^{T}) \leq r$ , where $\|\cdot\|_2$ denotes the Frobenius norm. Let $A$ be a $m \times n$ matrix, then $U$ will be $m \times r$ and $V$ will be $n \times r$ , in dimension, and $1 \leq r \ll \min(m, n)$ . This model is not intended to do the full decomposition, or to be used as part of inverse procedure. This model has been widely used in recommendation systems (e.g., Netflix [2]) and feature selection (e.g., image processing [3]).

Function Syntax

Low-rank matrix factorization of an incomplete matrix into two factors.

lmf_igd_run( rel_output,
             rel_source,
             col_row,
             col_column,
             col_value,
             row_dim,
             column_dim,
             max_rank,
             stepsize,
             scale_factor,
             num_iterations,
             tolerance
           )

Arguments

rel_output

TEXT. The name of the table to receive the output.

Output factors matrix U and V are in a flattened format.

RESULT AS (
        matrix_u    DOUBLE PRECISION[],
        matrix_v    DOUBLE PRECISION[],
        rmse        DOUBLE PRECISION
);

Features correspond to row i is matrix_u[i:i][1:r]. Features correspond to column j is matrix_v[j:j][1:r].

rel_source

TEXT. The name of the table containing the input data.

The input matrix> is expected to be of the following form:

{TABLE|VIEW} input_table (
    row    INTEGER,
    col    INTEGER,
    value  DOUBLE PRECISION
)

Input is contained in a table that describes an incomplete matrix, with available entries specified as (row, column, value). The input matrix should be 1-based, which means row >= 1, and col >= 1. NULL values are not expected.

col_row

TEXT. The name of the column containing the row number.

col_column

TEXT. The name of the column containing the column number.

col_value

DOUBLE PRECISION. The value at (row, col).

row_dim (optional)

INTEGER, default: "SELECT max(col_row) FROM rel_source". The number of columns in the matrix.

column_dim (optional)

INTEGER, default: "SELECT max(col_col) FROM rel_source". The number of rows in the matrix.

max_rank

INTEGER, default: 20. The rank of desired approximation.

stepsize (optional)

DOUBLE PRECISION, default: 0.01. Hyper-parameter that decides how aggressive the gradient steps are.

scale_factor (optional)

DOUBLE PRECISION, default: 0.1. Hyper-parameter that decides scale of initial factors.

num_iterations (optional)

INTEGER, default: 10. Maximum number if iterations to perform regardless of convergence.

tolerance (optional)

DOUBLE PRECISION, default: 0.0001. Acceptable level of error in convergence.

Examples

Prepare an input table/view:

DROP TABLE IF EXISTS lmf_data;
CREATE TABLE lmf_data (
 row INT,
 col INT,
 val FLOAT8
);

Populate the input table with some data.

INSERT INTO lmf_data VALUES (1, 1, 5.0);
INSERT INTO lmf_data VALUES (3, 100, 1.0);
INSERT INTO lmf_data VALUES (999, 10000, 2.0);

Call the lmf_igd_run() stored procedure.

DROP TABLE IF EXISTS lmf_model;
SELECT madlib.lmf_igd_run( 'lmf_model',
                           'lmf_data',
                           'row',
                           'col',
                           'val',
                           999,
                           10000,
                           3,
                           0.1,
                           2,
                           10,
                           1e-9
                         );

Example result (the exact result may not be the same).

NOTICE:
Finished low-rank matrix factorization using incremental gradient
DETAIL:
   table : lmf_data (row, col, val)
Results:
   RMSE = 0.0145966345300041
Output:
   view : SELECT * FROM lmf_model WHERE id = 1
 lmf_igd_run
 -----------
           1
 (1 row)

Sanity check of the result. You may need a model id returned and also indicated by the function lmf_igd_run(), assuming 1 here:

SELECT array_dims(matrix_u) AS u_dims, array_dims(matrix_v) AS v_dims
FROM lmf_model
WHERE id = 1;

Result:

     u_dims    |     v_dims
 --------------+----------------
  [1:999][1:3] | [1:10000][1:3]
 (1 row)

Query the result value.

SELECT matrix_u[2:2][1:3] AS row_2_features
FROM lmf_model
WHERE id = 1;

Example output (the exact result may not be the same):

                       row_2_features
 ---------------------------------------------------------
  {{1.12030523084104,0.522217971272767,0.0264869043603539}}
 (1 row)

Make prediction of a missing entry (row=2, col=7654).

SELECT madlib.array_dot(
    matrix_u[2:2][1:3],
    matrix_v[7654:7654][1:3]
    ) AS row_2_col_7654
FROM lmf_model
WHERE id = 1;

Example output (the exact result may not be the same due the randomness of the algorithm):

   row_2_col_7654
 ------------------
  1.3201582940851
 (1 row)

Literature

[1] N. Srebro and T. Jaakkola. “Weighted Low-Rank Approximations.” In: ICML. Ed. by T. Fawcett and N. Mishra. AAAI Press, 2003, pp. 720–727. isbn: 1-57735-189-4.

[2] Simon Funk, Netflix Update: Try This at Home, December 11 2006, http://sifter.org/~simon/journal/20061211.html

[3] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma. “Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization.” In: NIPS. Ed. by Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta. Curran Associates, Inc., 2009, pp. 2080–2088. isbn: 9781615679119.