Tecflix

Clustering in Depth Purchase the entire course

26 June 2013 · · 3904 views

Segmentation, Outlier Detection, Categorisation & Exploration

Microsoft Clustering is a a workhorse algorithm of data mining. It is used to explore data, segment and categorise it, and to detect outliers (or exceptions, anomalies). Each cluster represents naturally occuring groupings of your data points, grouped by their similarities, shown in their attributes. In this in-depth, 1-hour 50-minute video, Rafal explains clustering concepts, the entire process, and all of the algorithm parameters. The detailed 12-part demo, which forms the heart of this tutorial, shows you the iterative process of clustering, explaining how to segment your own data, such as customers, or products. Before watching this video, make sure to review the preceding, short, free 10-minute video “Why Cluster and Segment Data?”

In essence, the key to understanding clustering is to imagine that all of your cases (ie. data points, such as customers) are dots that exist in a vastly multidimensional space, where each dimension represents an attribute, that you would like to study, such as: age, gender, number of children, relationship to the head of household, income, etc. It is easy to imagine 3 dimensions: try it, or see the example in the video. However, clustering algorithm can represent your data points in even hundreds or thousands of dimensions! If you have imagined your data points suspended somewhere in a 3D space, you should realise that they do not fill the entire space uniformly, but they group, or cluster, into natural constellations of points.

When you have found your clusters you need to name them—this process is iterative. You start by trying to understand the natural meaning of each of the clusters by using cluster visualisations: the high-level Cluster Diagram that shows similarities between clusters, the dashboard-style Cluster Profiles that shows everything-at-a-glance, the individual Cluster Characteristics, and the Cluster Discriminations diagram, also known as the Tornado Chart, which is particularly useful for comparing the meaning of one cluster against another one, or for comparing the meaning of one cluster against all the other clusters (also referred to as the cluster’s complement). The extensive demo of the clustering process begins with a simpler, but still powerful, approach to clustering using Microsoft Excel and the free Data Mining Add-ins. Then, we continue onto each individual step of the process, using SSDT, SQL Server Data Tools.

To validate your clusters, or to simply use your clustering model, you can categorise a different data set. For example, you can take a set of customers, which you have not used for the building (training) of your model, and you apply your new model to predict the names of the clusters to which each such customer might belong. The demo shows how to use the Cluster() and ClusterProbability() prediction functions to categorise data in Excel using a model that has been built earlier in SSDT. The number of clusters that you wish to build, which you can specify as the value of the CLUSTER_COUNT parameter, has a great impact on the accuracy of your model—the demo shows how even a small change to that number significantly changes the model accuracy lift chart. If you are new to Lift Charts and mining model validation, make sure to review the relevant section of our “Data Mining Model Building, Testing and Predicting” video tutorial.

Microsoft SQL Server Data Mining comes with two different clustering techniques: EM, Expectation Maximization, and the widely known, K-Means, algorithms. You decide which one should be used by setting the value of the CLUSTERING_METHOD parameter, which takes four options (1–4), letting you also select between a Scalable or a Non-Scalable approach. The default value, 1, stands for scalable EM, and is often the best choice, because it permits clusters to overlap each other, which is a realistic representation of many natural data sets. However, if you are dealing with cases that have very clear-cut differences, which are unlikely to overlap, you may want to use the K-Means technique (parameter values of 3 and 4) instead of EM (values 1 and 2). You can use both techniques on continuous, discrete and discretized data—without even having to change your data set!—which a convenient feature of Microsoft clustering. You can further control the scalable approach with the SAMPLE_SIZE parameter which you should set to as large a value as your memory and CPU permits, or, if you have a large enough computer, use the value of 0 which tries to load the entire data set into memory.

The remaining parameters of the clustering algorithm can be used to balance the time it takes for clustering to complete its model training run against its accuracy. Because the algorithm is sensitive to the random starting values—seeded with the CLUSTER_SEED parameter—it normally builds not one, but, by default, 10 different models, and selects the best. You can change the number of models being built by using the MODEL_CARDINALITY parameter. You decide how stable the model has to become before clustering stops working on it by changing the STOPPING_TOLERANCE parameter, which, with a default value of 10 cases, means that up to 10 cases may remain “undecided” between the internal iterations of the algorithm in terms of their membership to a cluster. If you have a small data set, you may want to reduce this number to 1. In any case, with today’s machines that come with muscular CPUs and plenty of memory, it is easy to look for more accurate models by reducing this parameter. The remaining parameters, which help you fine-tune the algorithm to your data set, include the unofficial NORMALIZATION parameter, which you may want to change for those less-typical data sets that happen to have a non-normal distribution (eg. a uniform distribution), and the usual to SQL Server parameters that control how it selects significant attributes and their value states from your data. All of those are explained in detail in this tutorial: the MAXIMUM_INPUT_ATTRIBUTES (feature selection parameter, default 255), MAXIMUM_STATES (default 100), and MINIMUM_SUPPORT (default 1) useful for pruning out clusters that would contain fewer than that value number of cases.

Log in or purchase access to play the video.

  • Introduction to Data Mining with Microsoft SQL Server 24-min Watch with Free Subscription

  • Data Mining Concepts and Tools 50-min

  • Data Mining Model Building, Testing and Predicting with Microsoft SQL Server and Excel 1-hour 20-min

  • What Are Decision Trees? 10-min Free—Watch Now

  • Decision Trees in Depth 1-hour 54-min

  • Why Cluster and Segment Data? 9-min Watch with Free Subscription

  • Clustering in Depth 1-hour 50-min

  • What is Market Basket Analysis? 10-min Watch with Free Subscription

  • Association Rules in Depth 1-hour 35-min

  • HappyCars Sample Data Set for Learning Data Mining

  • Additional Code and Data Samples (R, ML Services, SSAS) Get with Free Subscription

Purchase a Full Access Subscription

 
Individual Subscription

$480/year

Access all content on this site for 1 year.
Purchase
Group Purchase

from $480/year

For small business & enterprise.
Group Purchase
 
  • You can also redeem a prepaid code.
  • Payments are instant and you will receive a tax invoice straight away.
  • We offer sales quotes/pro-forma invoices, and we accept purchase orders and bank transfers.
  • Your satisfaction is paramount: we offer a no-quibble refund guarantee.
  • See pricing FAQ for more detail.
In collaboration with
Project Botticelli logo Oxford Computer Training logo SQLBI logo Prodata logo