The primary documentation reference material providing detailed information on the functions and algorithms within MADlib as well as background theory and references into the literature.
Information on initial installation and deployment of MADlib into a database instance.
Introduction to themes and concepts in MADlib. The guide walks the user through an initial data load, training a model, inspecting a model, and scoring a model.
For developers who are interested in contributing to MADlib. Includes instructions for an available Docker image with necessary dependencies to compile and test MADlib.
Additional material for individuals looking to contribute to the project is available on our community portal.
Videos from community events, meetups and conferences. Also includes step-by-step guides for commonly used algorithms.
Linear regression is used to model the linear relationship of a scalar dependent variable to one or more explanatory independent variables.
Latent Dirichlet Allocation is a topic modeling function used to identify recurring themes in a large document corpus.
The summary function provides summary statistics for any data table. These statistics include: number of distinct values, number of missing values, mean, variance, min, max, most frequent values, quantiles, etc.
Logistic regression is used to predict a binary outcome of a dependent variable from one or more explanatory independent variables.
Elastic Net regularization is a technique that can be applied to either linear or logistic regression to build a more robust model, in the event of large numbers of explanatory independent variables.
Pricipal Component Analysis is a dimensional reduction technique that can be used to transform a high dimensional space into a lower dimensional space.
Apriori is a technique for evaluating frequent item-sets, which allows analysis of what events tend to occur together. For example, which items do customers frequently purchase together in a single transaction?
k-Means is a clustering method used to identify regions of similarity within a dataset. It can be used for many types of analysis including customer segmentation.