Tutorial Example

Activate Windows explorer, and navigate to the Samples directory, which appears as a sub-directory below the installation directory (probably Program Files/KlustaWin). Double-click on the file normal random 50.txt, to open it in Notepad and observe its contents. The first 5 lines are shown below. The

mean 0 sd 2 mean 6 sd 1 mean 20 sd 3
mean 0 sd 0.5     mean 15 sd 3      mean 12 sd 2
mean 3 sd 1 mean 12 sd 3      mean 0 sd 1
-6.046029739      3.005215846 11.07501935
0.320130766 7.23523705  17.67947884

.
.
.

The main body of the file contains 3 columns of data, representing 3 dimensions, so each row can be regarded as a point in 3-D space. There are 150 rows, and although this is not immediately apparent, these consist of 3 sections of 50 rows each. The numbers are constructed using the Random Number Generator facility in Excel, with each block of 50 numbers being generated from a normal distribution as shown in the 3 lines of text header at the top of the file. Now close the file

Drag the file from Windows Explorer, and drop it onto the KlustaWin window. A “cloud” of red data points should appear, with little obvious structure. Now select the Auto Y option within the Rotate frame, and observe as the graph rotates in 3-D space that the data do actually occur in moderately well-separated groups.

Change the value in the Runs edit box to 50 and click the CLUSTER button. Observe the changing colours in the display, and the numerical values of Num and Score parameters, as the runs count upwards towards 50. When analysis stops, the last result of the analysis run is visible, (labelled recent  in the drop-down list) and the score and number of clusters detected in that run  are displayed.

Click the drop-down list beside the Show: Best button, and you will see a list of cluster numbers. Clicking on one of these numbers will display the best result achieved for that number of clusters. Since the analysis is probabilistic in its properties, the exact numbers that you see may vary from run to run. The image below shows the results of one test run.

KlustaWin

The interpretation of the output is up to the user – but it is useful to remember that the clustering algorithm itself relies on Bayesian methodologies that incorporate information on the prior probabilities of outcomes. It is therefore legitimate to use knowledge derived from an understanding of the underlying biology (or, as in this case, the data-generating mechanism) to choose between the outcomes! However, the fact that the data were generated using a random number generator means that the larger cluster numbers might fortuitously reflect a better fit to the data.

In this case the 4-cluster outcome most closely reflects the “correct” cluster, in the sense of the mechanism used to generate the data. Select the 4  item from the drop-down list (assuming that you have one – you may not, given the probabilistic nature of the analysis). The results of the analysis that yielded the highest score for 4 clusters will be displayed. The likelihood is that you will see 3 fairly big clusters in 3 arbitrary colours, and 3 individual data points in red that do not clearly fit into one of the three main clusters. KlustaWin has identified these the 3 data points as belonging to class 1, which in KlustaWin is always the “noise” class – i.e. it contains data that are assumed to come from a normal distribution with a standard deviation so large that it can incorporate all data that do not fit into one of the more tightly defined clusters.

Now click on the Stats button. A dialog displays showing a summary of the results for each set of clusters. Part of this output is shown below:

Most recent run:

number clusters: 4, score: 247.5

cluster

n

mean d1

s.d. d1

mean d2

s.d. d2

mean d3

s.d. d3

cl 1

3

-2.4683

3.1948

4.1773

1.7458

4.7653

7.0642

cl 2

49

-0.052

0.4683

15.1448

3.0709

11.6963

1.868

cl 3

49

-0.1707

2.0498

6.1046

0.9375

20.1152

2.669

cl 4

49

2.966

1.0923

11.4317

2.9433

0.0527

0.8802

 

Best of set:

number clusters: 4, number occurrences 129, best score: 247.5, avg score = 247.5

cluster

n

mean d1

s.d. d1

mean d2

s.d. d2

mean d3

s.d. d3

cl 1

3

-2.4683

3.1948

4.1773

1.7458

4.7653

7.0642

cl 2

49

-0.052

0.4683

15.1448

3.0709

11.6963

1.868

cl 3

49

-0.1707

2.0498

6.1046

0.9375

20.1152

2.669

cl 4

49

2.966

1.0923

11.4317

2.9433

0.0527

0.8802

 

Best of set:

number clusters: 5, number occurrences 8, best score: 256.1, avg score = 237.7

cluster

n

mean d1

s.d. d1

mean d2

s.d. d2

mean d3

s.d. d3

cl 1

3

-2.4683

3.1948

4.1773

1.7458

4.7653

7.0642

cl 2

49

2.966

1.0923

11.4317

2.9433

0.0527

0.8802

cl 3

49

-0.052

0.4683

15.1448

3.0709

11.6963

1.868

cl 4

32

-0.4175

2.2246

6.0701

1.0796

19.8278

2.8227

cl 5

17

0.2939

1.6324

6.1696

0.6097

20.6561

2.336

Because we know how this file was generated, we know that the “correct” number of clusters is actually 3, or 4 if we include a noise cluster. If you look at the values of the parameters for the Best of set: number of clusters: 4 part of the table, you will see that they actually match the parameters used in generating the numbers quite closely. The disparity is partly due to the three data items assigned to noise rather than to their real parent cluster, and partly, of course, due to the fact that random numbers were used to generate the numbers, and these will not exactly match the generator parameters.

You can perform additional analyses on the same data, and the files will be updated to reflect new results. To reset the results back to starting conditions, reload the original data file.