Given a table with a number of feature columns, this function will create an output table that contains the feature columns in an array. A summary table will also be created that contains the names of the features combined into array, so that this process can be reversed using the function vec2cols.
cols2vec( source_table, output_table, list_of_features, list_of_features_to_exclude, cols_to_output )
Arguments
TEXT. Name of the table containing the source data.
TEXT. Name of the generated table containing the output.
TEXT. Comma-separated string of column names or expressions to put into feature array. Can also be '*' implying all columns are to be put into feature array (except for the ones included in the next argument that lists exclusions). Type casting will be done as per the regular type casting rules of the underlying database. Array columns in the source table are not supported in the 'list_of_features' parameter. Also, all of the features to be included must be of the same type and must not have null values.
TEXT. Default NULL. Comma-separated string of column names to exclude from the feature array. Typically used when 'list_of_features' is set to '*'.
Output table
The output table produced by the cols2vec function contains the following columns:
<...> | Columns from source table, depending on which ones are kept (if any). |
---|---|
feature_vector | Column that contains the feature array. |
Output summary table
A summary table named <output_table>_summary is also created that contains:
source_table | Name of the table containing the source data. |
---|---|
list_of_features | List of features to put in vector. |
list_of_features_to_exclude | Features specified by the user to exclude from 'list_of_features'. |
feature_names | Names of the features that were nested (converted to a vector) in the output table. |
DROP TABLE IF EXISTS golf CASCADE; CREATE TABLE golf ( id integer NOT NULL, "OUTLOOK" text, temperature double precision, humidity double precision, "Temp_Humidity" double precision[], clouds_airquality text[], windy boolean, class text, observation_weight double precision ); INSERT INTO golf VALUES (1,'sunny', 85, 85, ARRAY[85, 85],ARRAY['none', 'unhealthy'], 'false','Don''t Play', 5.0), (2, 'sunny', 80, 90, ARRAY[80, 90], ARRAY['none', 'moderate'], 'true', 'Don''t Play', 5.0), (3, 'overcast', 83, 78, ARRAY[83, 78], ARRAY['low', 'moderate'], 'false', 'Play', 1.5), (4, 'rain', 70, 96, ARRAY[70, 96], ARRAY['low', 'moderate'], 'false', 'Play', 1.0), (5, 'rain', 68, 80, ARRAY[68, 80], ARRAY['medium', 'good'], 'false', 'Play', 1.0), (6, 'rain', 65, 70, ARRAY[65, 70], ARRAY['low', 'unhealthy'], 'true', 'Don''t Play', 1.0), (7, 'overcast', 64, 65, ARRAY[64, 65], ARRAY['medium', 'moderate'], 'true', 'Play', 1.5), (8, 'sunny', 72, 95, ARRAY[72, 95], ARRAY['high', 'unhealthy'], 'false', 'Don''t Play', 5.0), (9, 'sunny', 69, 70, ARRAY[69, 70], ARRAY['high', 'good'], 'false', 'Play', 5.0), (10, 'rain', 75, 80, ARRAY[75, 80], ARRAY['medium', 'good'], 'false', 'Play', 1.0), (11, 'sunny', 75, 70, ARRAY[75, 70], ARRAY['none', 'good'], 'true', 'Play', 5.0), (12, 'overcast', 72, 90, ARRAY[72, 90], ARRAY['medium', 'moderate'], 'true', 'Play', 1.5), (13, 'overcast', 81, 75, ARRAY[81, 75], ARRAY['medium', 'moderate'], 'false', 'Play', 1.5), (14, 'rain', 71, 80, ARRAY[71, 80], ARRAY['low', 'unhealthy'], 'true', 'Don''t Play', 1.0);
DROP TABLE IF EXISTS cols2vec_result, cols2vec_result_summary; SELECT madlib.cols2vec( 'golf', 'cols2vec_result', 'temperature, humidity' ); SELECT * FROM cols2vec_result;
feature_vector ----------------+ {85,85} {80,90} {83,78} {70,96} {68,80} {65,70} {64,65} {72,95} {69,70} {75,80} {75,70} {72,90} {81,75} {71,80} (14 rows)View the summary table:
\x on SELECT * FROM cols2vec_result_summary; \x off
-[ RECORD 1 ]---------------+---------------------------------------------------------------- source_table | golf list_of_features | temperature, humidity list_of_features_to_exclude | None feature_names | {temperature,humidity}
DROP TABLE IF EXISTS cols2vec_result, cols2vec_result_summary; SELECT madlib.cols2vec( 'golf', 'cols2vec_result', 'temperature, humidity', NULL, 'id, "OUTLOOK"' ); SELECT * FROM cols2vec_result ORDER BY id;
id | OUTLOOK | feature_vector ----+----------+---------------- 1 | sunny | {85,85} 2 | sunny | {80,90} 3 | overcast | {83,78} 4 | rain | {70,96} 5 | rain | {68,80} 6 | rain | {65,70} 7 | overcast | {64,65} 8 | sunny | {72,95} 9 | sunny | {69,70} 10 | rain | {75,80} 11 | sunny | {75,70} 12 | overcast | {72,90} 13 | overcast | {81,75} 14 | rain | {71,80} (14 rows)View the summary table:
\x on SELECT * FROM cols2vec_result_summary; \x off
-[ RECORD 1 ]---------------+---------------------------------------------------------------- source_table | golf list_of_features | temperature, humidity list_of_features_to_exclude | None feature_names | {temperature,humidity}
DROP TABLE IF EXISTS cols2vec_result, cols2vec_result_summary; SELECT madlib.cols2vec( 'golf', 'cols2vec_result', '*', '"OUTLOOK", "Temp_Humidity", clouds_airquality, windy, class, id', 'id, "OUTLOOK"' ); SELECT * FROM cols2vec_result ORDER BY id;
id | OUTLOOK | feature_vector ----+----------+---------------- 1 | sunny | {85,85,5} 2 | sunny | {80,90,5} 3 | overcast | {83,78,1.5} 4 | rain | {70,96,1} 5 | rain | {68,80,1} 6 | rain | {65,70,1} 7 | overcast | {64,65,1.5} 8 | sunny | {72,95,5} 9 | sunny | {69,70,5} 10 | rain | {75,80,1} 11 | sunny | {75,70,5} 12 | overcast | {72,90,1.5} 13 | overcast | {81,75,1.5} 14 | rain | {71,80,1} (14 rows)View summary table:
\x on SELECT * FROM cols2vec_result_summary; \x off
-[ RECORD 1 ]---------------+---------------------------------------------------------------- source_table | golf list_of_features | * list_of_features_to_exclude | "OUTLOOK", "Temp_Humidity", clouds_airquality, windy, class, id feature_names | {temperature,humidity,observation_weight}
DROP TABLE IF EXISTS cols2vec_result, cols2vec_result_summary; SELECT madlib.cols2vec( 'golf', 'cols2vec_result', 'windy, temperature, humidity', 'windy', '*' ); SELECT * FROM cols2vec_result ORDER BY id;
id | OUTLOOK | temperature | humidity | Temp_Humidity | clouds_airquality | windy | class | observation_weight | feature_vector ----+----------+-------------+----------+---------------+-------------------+-------+------------+--------------------+---------------- 1 | sunny | 85 | 85 | {85,85} | {none,unhealthy} | f | Don't Play | 5 | {85,85} 2 | sunny | 80 | 90 | {80,90} | {none,moderate} | t | Don't Play | 5 | {80,90} 3 | overcast | 83 | 78 | {83,78} | {low,moderate} | f | Play | 1.5 | {83,78} 4 | rain | 70 | 96 | {70,96} | {low,moderate} | f | Play | 1 | {70,96} 5 | rain | 68 | 80 | {68,80} | {medium,good} | f | Play | 1 | {68,80} 6 | rain | 65 | 70 | {65,70} | {low,unhealthy} | t | Don't Play | 1 | {65,70} 7 | overcast | 64 | 65 | {64,65} | {medium,moderate} | t | Play | 1.5 | {64,65} 8 | sunny | 72 | 95 | {72,95} | {high,unhealthy} | f | Don't Play | 5 | {72,95} 9 | sunny | 69 | 70 | {69,70} | {high,good} | f | Play | 5 | {69,70} 10 | rain | 75 | 80 | {75,80} | {medium,good} | f | Play | 1 | {75,80} 11 | sunny | 75 | 70 | {75,70} | {none,good} | t | Play | 5 | {75,70} 12 | overcast | 72 | 90 | {72,90} | {medium,moderate} | t | Play | 1.5 | {72,90} 13 | overcast | 81 | 75 | {81,75} | {medium,moderate} | f | Play | 1.5 | {81,75} 14 | rain | 71 | 80 | {71,80} | {low,unhealthy} | t | Don't Play | 1 | {71,80} (14 rows)View the summary table:
\x on SELECT * FROM cols2vec_result_summary; \x off
-[ RECORD 1 ]---------------+----------------------------- source_table | golf list_of_features | windy, temperature, humidity list_of_features_to_exclude | windy feature_names | {temperature,humidity}This also shows that you can exclude features in 'list_of_features_to_exclude' that are in the list of 'list_of_features'. This can be useful if the 'list_of_features' is generated from an expression or subquery.
DROP TABLE IF EXISTS cols2vec_result, cols2vec_result_summary; SELECT madlib.cols2vec( 'golf', 'cols2vec_result', 'windy::TEXT, class', NULL, 'id' ); SELECT * FROM cols2vec_result ORDER BY id;
id | feature_vector -—+-------------------— 1 | {false,"Don't Play"} 2 | {true,"Don't Play"} 3 | {false,Play} 4 | {false,Play} 5 | {false,Play} 6 | {true,"Don't Play"} 7 | {true,Play} 8 | {false,"Don't Play"} 9 | {false,Play} 10 | {false,Play} 11 | {true,Play} 12 | {true,Play} 13 | {false,Play} 14 | {true,"Don't Play"} (14 rows)