2.1.0
User Documentation for Apache MADlib
k-Nearest Neighbors

K-nearest neighbors is a method for finding the \(k\) closest points to a given data point in terms of a given metric. Its input consists of data points as features from testing examples and it looks for \(k\) closest points in the training set for each of the data points in test set. The output of KNN depends on the type of task. For classification, the output is the majority vote of the classes of the \(k\) nearest data points. For regression, the output is the average of the values of \(k\) nearest neighbors of the given test point.

For unsupervised nearest neighbors, set the training set to match the test set so the nearest neighbor of each point is the point itself, with zero distance.

Both exact and approximate methods are supported. The approximate methods can be used in the case that run-time is too long using the exact method.

Usage
knn( point_source,
     point_column_name,
     point_id,
     label_column_name,
     test_source,
     test_column_name,
     test_id,
     output_table,
     k,
     output_neighbors,
     fn_dist,
     weighted_avg,
     algorithm,
     algorithm_params
   )

Arguments

point_source

TEXT. Name of the table containing the training data points. Training data points are expected to be stored row-wise in a column of type DOUBLE PRECISION[].

point_column_name

TEXT. Name of the column with training data points or expression that evaluates to a numeric array

point_id

TEXT. Name of the column in 'point_source’ containing source data ids. The ids are of type INTEGER with no duplicates. They do not need to be contiguous. This parameter must be used if the list of nearest neighbors are to be output, i.e., if the parameter 'output_neighbors' below is TRUE or if 'label_column_name' is NULL.

label_column_name

TEXT. Name of the column with labels/values of training data points. If this column is a Boolean, integer or text, it will run KNN classification, else if it is double precision values will run KNN regression. If you set this to NULL, it will only return the set of neighbors without actually doing classification or regression.

test_source

TEXT. Name of the table containing the test data points. Testing data points are expected to be stored row-wise in a column of type DOUBLE PRECISION[].

test_column_name

TEXT. Name of the column with testing data points or expression that evaluates to a numeric array

Note
For unsupervised nearest neighbors, make the test dataset the same as the source dataset, so the nearest neighbor of each point is the point itself, with a zero distance.
test_id

TEXT. Name of the column having ids of data points in test data table.

output_table

TEXT. Name of the table to store final results.

k (optional)

INTEGER. default: 1. Number of nearest neighbors to consider. For classification, should be an odd number to break ties, otherwise the result may depend on ordering of the input data.

output_neighbors (optional)

BOOLEAN default: TRUE. Outputs the list of k-nearest neighbors (and their respective distances to the target point) that were used in the voting/averaging, sorted from closest to furthest.

fn_dist (optional)

TEXT, default: 'squared_dist_norm2'. The name of the function used to calculate the distance between data points.

The following distance functions can be used:

Note
Always qualify the distance function with the schema name. For example, if you install MADlib in a schema called 'madlib' then the 'fn_dist' parameter would be 'madlib.dist_norm2', etc.
weighted_avg (optional)

BOOLEAN, default: FALSE. Calculates classification or regression values using a weighted average. The idea is to weigh the contribution of each of the k neighbors according to their distance to the test point, giving greater influence to closer neighbors. The distance function 'fn_dist' specified above is used. For classification, majority voting weighs a neighbor according to inverse distance. For regression, the inverse distance weighting approach is used from Shepard [4].

algorithm (optional)

TEXT, default: 'brute_force'. The name of the algorithm used to compute nearest neighbors. The following options are supported:

  • brute_force: Produces an exact result by searching all points in the search space. You can also use a short form "b" or "brute" etc. to select brute force.
  • kd_tree: Produces an approximate result by searching a subset of the search space, that is, only certain leaf nodes in the kd-tree as specified by "algorithm_params" below. You can also use a short form "k" or "kd" etc. to select kd-tree.

algorithm_params (optional)

TEXT, default: 'depth=3, leaf_nodes=2'. These parameters apply to the kd-tree algorithm only.

  • depth: Depth of the kd-tree. Increasing this value will decrease run-time but reduce the accuracy.
  • leaf_nodes: Number of leaf nodes (regions) to search for each test point. Inceasing this value will improve the accuracy but increase run-time.
Note
Please note that the kd-tree accuracy will be lower for datasets with a high number of features. It is advised to use at least two leaf nodes. Refer to the Technical Background for more information on how the kd-tree is implemented.

Output Format

The output of the KNN module is a table with the following columns:

id INTEGER. The ids of test data points.
test_column_name DOUBLE PRECISION[]. The test data points.
prediction INTEGER. Label in case of classification, average value in case of regression.
k_nearest_neighbours INTEGER[]. List of nearest neighbors, sorted closest to furthest from the corresponding test point.
distance DOUBLE PRECISION[]. List of distance to nearest neighbors, sorted closest to furthest from the corresponding test point.

Examples
  1. Prepare some training data for classification:
    DROP TABLE IF EXISTS knn_train_data;
    CREATE TABLE knn_train_data (
                        id integer,
                        data integer[],
                        label integer  -- Integer label means for classification
                        );
    INSERT INTO knn_train_data VALUES
    (1, '{1,1}', 1),
    (2, '{2,2}', 1),
    (3, '{3,3}', 1),
    (4, '{4,4}', 1),
    (5, '{4,5}', 1),
    (6, '{20,50}', 0),
    (7, '{10,31}', 0),
    (8, '{81,13}', 0),
    (9, '{1,111}', 0);
    
  2. Prepare some training data for regression:
    DROP TABLE IF EXISTS knn_train_data_reg;
    CREATE TABLE knn_train_data_reg (
                        id integer,
                        data integer[],
                        label float  -- Float label means for regression
                        );
    INSERT INTO knn_train_data_reg VALUES
    (1, '{1,1}', 1.0),
    (2, '{2,2}', 1.0),
    (3, '{3,3}', 1.0),
    (4, '{4,4}', 1.0),
    (5, '{4,5}', 1.0),
    (6, '{20,50}', 0.0),
    (7, '{10,31}', 0.0),
    (8, '{81,13}', 0.0),
    (9, '{1,111}', 0.0);
    
  3. Prepare some testing data:
    DROP TABLE IF EXISTS knn_test_data CASCADE;
    CREATE TABLE knn_test_data (
                        id integer,
                        data integer[]
                        );
    INSERT INTO knn_test_data VALUES
    (1, '{2,1}'),
    (2, '{2,6}'),
    (3, '{15,40}'),
    (4, '{12,1}'),
    (5, '{2,90}'),
    (6, '{50,45}');
    
  4. Run KNN for classification. Prepend the distance function parameter with the schema where MADlib is installed (in this example 'madlib.squared_dist_norm2'):
    DROP TABLE IF EXISTS knn_result_classification;
    SELECT * FROM madlib.knn(
                    'knn_train_data',      -- Table of training data
                    'data',                -- Col name of training data
                    'id',                  -- Col name of id in train data
                    'label',               -- Training labels
                    'knn_test_data',       -- Table of test data
                    'data',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_classification',  -- Output table
                     3,                    -- Number of nearest neighbors
                     True,                 -- True to list nearest-neighbors by id
                    'madlib.squared_dist_norm2' -- Distance function
                    );
    SELECT * from knn_result_classification ORDER BY id;
    
    Result:
     id |  data   | prediction | k_nearest_neighbours | distance
    ----+---------+------------+----------------------+---------------------
      1 | {2,1}   |          1 | {1,2,3}              | {1,1,5}
      2 | {2,6}   |          1 | {5,4,3}              | {5,8,10}
      3 | {15,40} |          0 | {7,6,5}              | {106,125,1346}
      4 | {12,1}  |          1 | {4,5,3}              | {73,80,85}
      5 | {2,90}  |          0 | {9,6,7}              | {442,1924,3545}
      6 | {50,45} |          0 | {6,7,8}              | {925,1796,1985}
    (6 rows)
    
    Note that the nearest neighbors are sorted from closest to furthest from the corresponding test point.
  5. Run KNN for regression:
    DROP TABLE IF EXISTS knn_result_regression;
    SELECT * FROM madlib.knn(
                    'knn_train_data_reg',  -- Table of training data
                    'data',                -- Col name of training data
                    'id',                  -- Col Name of id in train data
                    'label',               -- Training labels
                    'knn_test_data',       -- Table of test data
                    'data',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_regression',  -- Output table
                     3,                    -- Number of nearest neighbors
                     True,                 -- True to list nearest-neighbors by id
                    'madlib.dist_norm2'    -- Distance function
                    );
    SELECT * FROM knn_result_regression ORDER BY id;
    
    Result:
     id |  data   |    prediction     | k_nearest_neighbours |                 distance
    ----+---------+-------------------+----------------------+------------------------------------------------------
      1 | {2,1}   |                 1 | {1,2,3}              | {1,1,2.23606797749979}
      2 | {2,6}   |                 1 | {5,4,3}              | {2.23606797749979,2.82842712474619,3.16227766016838}
      3 | {15,40} | 0.333333333333333 | {7,6,5}              | {10.295630140987,11.1803398874989,36.6878726556883}
      4 | {12,1}  |                 1 | {4,5,3}              | {8.54400374531753,8.94427190999916,9.21954445729289}
      5 | {2,90}  |                 0 | {9,6,7}              | {21.0237960416286,43.8634243989226,59.5399025864168}
      6 | {50,45} |                 0 | {6,7,8}              | {30.4138126514911,42.3792402008342,44.5533388198909}
    (6 rows)
    
  6. List nearest neighbors only, without doing classification or regression:
    DROP TABLE IF EXISTS knn_result_list_neighbors;
    SELECT * FROM madlib.knn(
                    'knn_train_data_reg',  -- Table of training data
                    'data',                -- Col name of training data
                    'id',                  -- Col Name of id in train data
                     NULL,                 -- NULL training labels means just list neighbors
                    'knn_test_data',       -- Table of test data
                    'data',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_list_neighbors', -- Output table
                     3                     -- Number of nearest neighbors
                    );
    SELECT * FROM knn_result_list_neighbors ORDER BY id;
    
    Result, with neighbors sorted from closest to furthest:
     id |  data   | k_nearest_neighbours | distance
    ----+---------+----------------------+---------------------
      1 | {2,1}   | {1,2,3}              | {1,1,5}
      2 | {2,6}   | {5,4,3}              | {5,8,10}
      3 | {15,40} | {7,6,5}              | {106,125,1346}
      4 | {12,1}  | {4,5,3}              | {73,80,85}
      5 | {2,90}  | {9,6,7}              | {442,1924,3545}
      6 | {50,45} | {6,7,8}              | {925,1796,1985}
    (6 rows)
    
  7. Run KNN for classification using the weighted average:
    DROP TABLE IF EXISTS knn_result_classification;
    SELECT * FROM madlib.knn(
                    'knn_train_data',      -- Table of training data
                    'data',                -- Col name of training data
                    'id',                  -- Col name of id in train data
                    'label',               -- Training labels
                    'knn_test_data',       -- Table of test data
                    'data',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_classification',  -- Output table
                     3,                    -- Number of nearest neighbors
                     True,                 -- True to list nearest-neighbors by id
                    'madlib.squared_dist_norm2', -- Distance function
                     True                 -- For weighted average
                    );
    SELECT * FROM knn_result_classification ORDER BY id;
    
     id |  data   | prediction | k_nearest_neighbours | distance
    ----+---------+------------+----------------------+---------------------
      1 | {2,1}   |          1 | {1,2,3}              | {1,1,5}
      2 | {2,6}   |          1 | {5,4,3}              | {5,8,10}
      3 | {15,40} |          0 | {7,6,5}              | {106,125,1346}
      4 | {12,1}  |          1 | {4,5,3}              | {73,80,85}
      5 | {2,90}  |          0 | {9,6,7}              | {442,1924,3545}
      6 | {50,45} |          0 | {6,7,8}              | {925,1796,1985}
    (6 rows)
    
  8. Use kd-tree option. First we build a kd-tree to depth 4 and search half (8) of the 16 leaf nodes (i.e., 2^4 total leaf nodes):
    DROP TABLE IF EXISTS knn_result_classification_kd;
    SELECT madlib.knn(
                    'knn_train_data',        -- Table of training data
                    'data',                  -- Col name of training data
                    'id',                    -- Col name of id in train data
                     NULL,                   -- Training labels
                    'knn_test_data',         -- Table of test data
                    'data',                  -- Col name of test data
                    'id',                    -- Col name of id in test data
                    'knn_result_classification_kd',  -- Output table
                     3,                      -- Number of nearest neighbors
                     True,                   -- True to list nearest-neighbors by id
                    'madlib.squared_dist_norm2', -- Distance function
                     False,                  -- For weighted average
                    'kd_tree',               -- Use kd-tree
                    'depth=4, leaf_nodes=8'  -- Kd-tree options
                     );
    SELECT * FROM knn_result_classification_kd ORDER BY id;
    
     id |  data   | k_nearest_neighbours | distance
    ----+---------+----------------------+---------------------
      1 | {2,1}   | {1,2,3}              | {1,1,5}
      2 | {2,6}   | {5,4,3}              | {5,8,10}
      3 | {15,40} | {7,6,5}              | {106,125,1346}
      4 | {12,1}  | {4,5,3}              | {73,80,85}
      5 | {2,90}  | {9,6,7}              | {442,1924,3545}
      6 | {50,45} | {6,7,8}              | {925,1796,1985}
    (6 rows)
    
    The result above is the same as brute force. If we search just 1 leaf node, run-time will be faster but accuracy will be lower. This shows up in this very small data set by not being able to find 3 nearest neighbors for all test points:
    DROP TABLE IF EXISTS knn_result_classification_kd;
    SELECT madlib.knn(
                    'knn_train_data',        -- Table of training data
                    'data',                  -- Col name of training data
                    'id',                    -- Col name of id in train data
                     NULL,                   -- Training labels
                    'knn_test_data',         -- Table of test data
                    'data',                  -- Col name of test data
                    'id',                    -- Col name of id in test data
                    'knn_result_classification_kd',  -- Output table
                     3,                      -- Number of nearest neighbors
                     True,                   -- True to list nearest-neighbors by id
                    'madlib.squared_dist_norm2', -- Distance function
                     False,                  -- For weighted average
                    'kd_tree',               -- Use kd-tree
                    'depth=4, leaf_nodes=1'  -- Kd-tree options
                     );
    SELECT * FROM knn_result_classification_kd ORDER BY id;
    
     id |  data   | k_nearest_neighbours | distance
    ----+---------+----------------------+---------------------
      1 | {2,1}   | {1}                  | {1}
      2 | {2,6}   | {3,2}                | {10,16}
      3 | {15,40} | {7}                  | {106}
      5 | {2,90}  | {3,2}                | {7570,7744}
      6 | {50,45} | {6,8}                | {925,1985}
    (5 rows)
    
  9. Unsupervised nearest neighbors. Here the training set matches the test set so the nearest neighbor of each point is the point itself, with a zero distance:
    DROP TABLE IF EXISTS knn_result_classification_unsup;
    SELECT * FROM madlib.knn(
                    'knn_train_data',      -- Table of training data
                    'data',                -- Col name of training data
                    'id',                  -- Col name of id in train data
                     NULL,                 -- NULL training labels means just list neighbors
                    'knn_train_data',      -- Table of test data (same as training data)
                    'data',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_classification_unsup',  -- Output table
                     3,                    -- Number of nearest neighbors
                     True,                 -- True to list nearest-neighbors by id
                    'madlib.squared_dist_norm2' -- Distance function
                    );
    SELECT * from knn_result_classification_unsup ORDER BY id;
    
    Result, with point and neighbors sorted from closest to furthest:
     id |  data   | k_nearest_neighbours |   distance
    ----+---------+----------------------+---------------
      1 | {1,1}   | {1,2,3}              | {0,2,8}
      2 | {2,2}   | {2,3,1}              | {0,2,2}
      3 | {3,3}   | {3,2,4}              | {0,2,2}
      4 | {4,4}   | {4,5,3}              | {0,1,2}
      5 | {4,5}   | {5,4,3}              | {0,1,5}
      6 | {20,50} | {6,7,5}              | {0,461,2281}
      7 | {10,31} | {7,6,5}              | {0,461,712}
      8 | {81,13} | {8,6,7}              | {0,5090,5365}
      9 | {1,111} | {9,6,7}              | {0,4082,6481}
    (9 rows)
    
  10. User defined distance function. There are several built-in distance functions, but you can create your own using a UDF if desired. For example, to create a Chebyshev distance function [6], first create the function:
    CREATE OR REPLACE FUNCTION chebychev_distance (x double precision[], y double precision[])
      RETURNS double precision
    AS $$
        from scipy.spatial import distance
        return distance.chebyshev(x, y)
    $$ LANGUAGE plpython3u;
    
    Then pass the function as an argument:
    DROP TABLE IF EXISTS knn_result_classification_udf;
    SELECT * FROM madlib.knn(
                    'knn_train_data',      -- Table of training data
                    'data',                -- Col name of training data
                    'id',                  -- Col name of id in train data
                    'label',               -- Training labels
                    'knn_test_data',       -- Table of test data
                    'data',                -- Col name of test data
                    'id',                  -- Col name of id in test data
                    'knn_result_classification_udf',  -- Output table
                     3,                    -- Number of nearest neighbors
                     True,                 -- True to list nearest-neighbors by id
                    'chebychev_distance'   -- Distance function
                    );
    SELECT * from knn_result_classification_udf ORDER BY id;
    
    Result, with point and neighbors sorted from closest to furthest:
     id |  data   | prediction | k_nearest_neighbours |  distance
    ----+---------+------------+----------------------+------------
      1 | {2,1}   |          1 | {1,2,3}              | {1,1,2}
      2 | {2,6}   |          1 | {5,4,3}              | {2,2,3}
      3 | {15,40} |          0 | {7,6,5}              | {9,10,35}
      4 | {12,1}  |          1 | {5,4,3}              | {8,8,9}
      5 | {2,90}  |          0 | {9,6,7}              | {21,40,59}
      6 | {50,45} |          0 | {6,8,7}              | {30,32,40}
    (6 rows)
    

Technical Background

The training data points are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training points.

In the prediction phase, \(k\) is a user-defined constant, and an unlabeled vector (a test point) is predicted by using the label from the the \(k\) training samples nearest to that test point.

Since distances between points are used to find the nearest neighbors, the data should be standardized across features. This ensures that all features are given equal weightage in the distance computation.

An approximation method can be used to speed the prediction phase by building appropriate data structures in the training phase. An example of such a data structure is kd-trees [5]. Using the kd-tree algorithm can improve the execution time of the \(k\)-NN operation, but at expense of sacrificing some accuracy. The kd-tree implementation divides the training dataset into multiple regions that correspond to the leaf nodes of a tree. For example, a tree of depth \(3\) will have a total of \(2^3 = 8\) regions. The algorithm will look for the nearest neighbors in a subset of all regions instead of searching the whole dataset. For a given test point, the first (home) region is found by traversing the tree and finding its associated node. If the user requests additional leaf nodes to be searched, we look at the distance between the point and the centroids of other regions and expand the search to the specified number of closest regions.

It's important to note that the nodes that each level of the kd-tree search over a single feature and the features are explored in the same order as that in the data.

The kd-tree accuracy might suffer on datasets with a high number of features (dimensions). For example, let's say we are using a dataset with 20 features and kd-tree depth of only 3. This means the kd-tree is constructed based on the first 3 features. Therefore, it is possible to miss nearest neighbors that are closer in those 17 dimensions because they got assigned to a further region (the distance computation would still uses all 20 features).

Literature

[1] Wikipedia, k-nearest neighbors algorithm, https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

[2] N. S. Altman: An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression http://www.stat.washington.edu/courses/stat527/s13/readings/Altman_AmStat_1992.pdf

[3] Gongde Guo1, Hui Wang, David Bell, Yaxin Bi, Kieran Greer: KNN Model-Based Approach in Classification, https://ai2-s2-pdfs.s3.amazonaws.com/a7e2/814ec5db800d2f8c4313fd436e9cf8273821.pdf

[4] Shepard, Donald (1968). "A two-dimensional interpolation function for irregularly-spaced data". Proceedings of the 1968 ACM National Conference. pp. 517–524.

[5] Bentley, J. L. (1975). "Multidimensional binary search trees used for associative searching". Communications of the ACM. 18 (9): 509. doi:10.1145/361002.361007.

[6] https://en.wikipedia.org/wiki/Chebyshev_distance