Clustering refers to the problem of partitioning a set of objects according to some problem-dependent measure of similarity. In the k-means variant, given \( n \) points \( x_1, \dots, x_n \in \mathbb R^d \), the goal is to position \( k \) centroids \( c_1, \dots, c_k \in \mathbb R^d \) so that the sum of distances between each point and its closest centroid is minimized. Each centroid represents a cluster that consists of all points to which this centroid is closest.
The k-means algorithm can be invoked in four ways, depending on the source of the initial set of centroids:
kmeans_random( rel_source, expr_point, k, fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned )
kmeanspp( rel_source, expr_point, k, fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned, seeding_sample_ratio )
kmeans( rel_source, expr_point, rel_initial_centroids, expr_centroid, fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned )
kmeans( rel_source, expr_point, initial_centroids, fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned )Arguments
TEXT. The name of the table containing the input data points.
Data points and predefined centroids (if used) are expected to be stored row-wise, in a column of type SVEC
(or any type convertible to SVEC
, like FLOAT[]
or INTEGER[]
). Data points with non-finite values (NULL, NaN, infinity) in any component are skipped during analysis.
TEXT. The name of the column with point coordinates.
INTEGER. The number of centroids to calculate.
TEXT, default: squared_dist_norm2'. The name of the function to use to calculate the distance from a data point to a centroid.
The following distance functions can be used (computation of barycenter/mean in parentheses):
DOUBLE PRECISION[] x, DOUBLE PRECISION[] y -> DOUBLE PRECISION
TEXT, default: 'avg'. The name of the aggregate function used to determine centroids.
The following aggregate functions can be used:
INTEGER, default: 20. The maximum number of iterations to perform.
DOUBLE PRECISION, default: 0.001. The minimum fraction of centroids reassigned to continue iterating. When fewer than this fraction of centroids are reassigned in an iteration, the calculation completes.
DOUBLE PRECISION, default: 1.0. The proportion of subsample of original dataset to use for kmeans++ centroid seeding method. Kmeans++ scans through the data sequentially 'k' times and can be too slow for big datasets. When 'seeding_sample_ratio' is greater than 0 (thresholded to be maximum value of 1.0), the seeding is run on an uniform random subsample of the data. Note: the final K-means algorithm is run on the complete dataset. This parameter only builds a subsample for the seeding and is only available for kmeans++.
TEXT. The set of initial centroids. The centroid relation is expected to be of the following form:
{TABLE|VIEW} rel_initial_centroids ( ... expr_centroid DOUBLE PRECISION[], ... )
where expr_centroid is the name of a column with coordinates.
TEXT. The name of the column in the rel_initial_centroids relation that contains the centroid coordinates.
The output of the k-means module is a composite type with the following columns:
centroids | DOUBLE PRECISION[][]. The final centroid positions. |
---|---|
objective_fn | DOUBLE PRECISION. The value of the objective function. |
frac_reassigned | DOUBLE PRECISION. The fraction of points reassigned in the last iteration. |
num_iterations | INTEGER. The total number of iterations executed. |
After training, the cluster assignment for each data point can be computed with the help of the following function:
closest_column( m, x )
Argument
Output format
column_id | INTEGER. The cluster assignment (zero-based). |
---|---|
distance | DOUBLE PRECISION. The distance to the cluster centroid. |
CREATE TABLE public.km_sample(pid int, points double precision[]); COPY km_sample (pid, points) FROM stdin DELIMITER '|'; 1 | {14.23, 1.71, 2.43, 15.6, 127, 2.8, 3.0600, 0.2800, 2.29, 5.64, 1.04, 3.92, 1065} 2 | {13.2, 1.78, 2.14, 11.2, 1, 2.65, 2.76, 0.26, 1.28, 4.38, 1.05, 3.49, 1050} 3 | {13.16, 2.36, 2.67, 18.6, 101, 2.8, 3.24, 0.3, 2.81, 5.6799, 1.03, 3.17, 1185} 4 | {14.37, 1.95, 2.5, 16.8, 113, 3.85, 3.49, 0.24, 2.18, 7.8, 0.86, 3.45, 1480} 5 | {13.24, 2.59, 2.87, 21, 118, 2.8, 2.69, 0.39, 1.82, 4.32, 1.04, 2.93, 735} 6 | {14.2, 1.76, 2.45, 15.2, 112, 3.27, 3.39, 0.34, 1.97, 6.75, 1.05, 2.85, 1450} 7 | {14.39, 1.87, 2.45, 14.6, 96, 2.5, 2.52, 0.3, 1.98, 5.25, 1.02, 3.58, 1290} 8 | {14.06, 2.15, 2.61, 17.6, 121, 2.6, 2.51, 0.31, 1.25, 5.05, 1.06, 3.58, 1295} 9 | {14.83, 1.64, 2.17, 14, 97, 2.8, 2.98, 0.29, 1.98, 5.2, 1.08, 2.85, 1045} 10 | {13.86, 1.35, 2.27, 16, 98, 2.98, 3.15, 0.22, 1.8500, 7.2199, 1.01, 3.55, 1045} \.
\x on; SELECT * FROM madlib.kmeanspp( 'km_sample', 'points', 2, 'madlib.squared_dist_norm2', 'madlib.avg', 20, 0.001 );Result:
centroids | {{13.872,1.814,2.376,15.56,88.2,2.806,2.928,0.288,1.844,5.35198,1.044,3.348,988}, {14.036,2.018,2.536,16.56,108.6,3.004,3.03,0.298,2.038,6.10598,1.004,3.326,1340}} objective_fn | 151184.962672 frac_reassigned | 0 num_iterations | 2
SELECT * FROM madlib.simple_silhouette( 'km_sample', 'points', (SELECT centroids FROM madlib.kmeanspp('km_sample', 'points', 2, 'madlib.squared_dist_norm2', 'madlib.avg', 20, 0.001)), 'madlib.dist_norm2' );Result:
simple_silhouette | 0.68978804882941
\x off; SELECT data.*, (madlib.closest_column(centroids, points)).column_id as cluster_id FROM public.km_sample as data, (SELECT centroids FROM madlib.kmeanspp('km_sample', 'points', 2, 'madlib.squared_dist_norm2', 'madlib.avg', 20, 0.001)) as centroids ORDER BY data.pid;
pid | points | cluster_id -----+--------------------------------------------------------------------+------------ 1 | {14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065} | 0 2 | {13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050} | 0 3 | {13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185} | 1 4 | {14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480} | 1 5 | {13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735} | 0 6 | {14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450} | 1 7 | {14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290} | 1 8 | {14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295} | 1 9 | {14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045} | 0 10 | {13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.2199,1.01,3.55,1045} | 0
The algorithm stops when one of the following conditions is met:
A popular method to assess the quality of the clustering is the silhouette coefficient, a simplified version of which is provided as part of the k-means module. Note that for large data sets, this computation is expensive.
The silhouette function has the following syntax:
simple_silhouette( rel_source, expr_point, centroids, fn_dist )
Arguments
Formally, we wish to minimize the following objective function:
\[ (c_1, \dots, c_k) \mapsto \sum_{i=1}^n \min_{j=1}^k \operatorname{dist}(x_i, c_j) \]
In the most common case, \( \operatorname{dist} \) is the square of the Euclidean distance.
This problem is computationally difficult (NP-hard), yet the local-search heuristic proposed by Lloyd [4] performs reasonably well in practice. In fact, it is so ubiquitous today that it is often referred to as the standard algorithm or even just the k-means algorithm [1]. It works as follows:
Since the objective function decreases in every step, this algorithm is guaranteed to converge to a local optimum.
[1] Wikipedia, K-means Clustering, http://en.wikipedia.org/wiki/K-means_clustering
[2] David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'07), pp. 1027-1035, http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf
[3] E. R. Hruschka, L. N. C. Silva, R. J. G. B. Campello: Clustering Gene-Expression Data: A Hybrid Approach that Iterates Between k-Means and Evolutionary Search. In: Studies in Computational Intelligence - Hybrid Evolutionary Algorithms. pp. 313-335. Springer. 2007.
[4] Lloyd, Stuart: Least squares quantization in PCM. Technical Note, Bell Laboratories. Published much later in: IEEE Transactions on Information Theory 28(2), pp. 128-137. 1982.
[5] Leisch, Friedrich: A Toolbox for K-Centroids Cluster Analysis. In: Computational Statistics and Data Analysis, 51(2). pp. 526-544. 2006.
File kmeans.sql_in documenting the k-Means SQL functions
simple_silhouette()