Click the CLUSTER button in the Analysis panel to start an analysis.
By default, a single analysis run will be performed. At the end of the analysis, the Num parameter indicates the number of clusters detected, and Score parameter indicates the relative probability of the data resulting from that number of clusters. The higher the score, the “better” the fit between the data and the cluster model.
Note, the numerical values of the score can only be compared within a particular dataset. Different datasets will produce very different scores for equally good fits.
The 3-D scatter graph shows the data points comprising each cluster in a different colour. You can check and adjust which colour represents which cluster class by selecting the class from the Clusters drop-down list. The coloured block below the list indicates the colour of that cluster class. You can change the colour by clicking on the block and selecting a new colour from the dialog.
More details of the display features are given in Observing 3-D data.
Cluster analysis works by initially assigning data items to random clusters, and then adjusting the clusters to maximise the likelihood of the cluster model given the data. The final cluster model is dependent on the initial random assignment. It is therefore a very good idea to perform multiple analysis runs, and to accept the run that produces the best apparent fit to the data. This will usually, although not always, be the run with the highest score.
To automatically perform multiple analyses, enter the number of runs that you wish to perform in the editable Runs box. Typically, one should perform 50 or more runs. When you click the CLUSTER button, that number of runs will be carried out. The box just below the edit box shows the cumulative total of runs as it builds up to the number that you entered into the box. The box to the right shows the total number of runs performed on that dataset; i.e. unlike the other total it does not reset to zero if you press the CLUSTER button again. To reset this cumulative count, you must reload the data.
KlustaWin will store the results of the most recent analysis, plus the results for the best score for each number of clusters detected. There is a text label Show with a button Best to the right of it, and a drop-down list to the right of that. The top item in the list is labelled recent, and other items in the list show the number of clusters detected, followed by the best score for that number of clusters. If you select an item from this list, this cluster set will display. Clicking the Best button will automatically display the cluster set with the highest score.
You can save the cluster analysis currently displayed by clicking the Save button. If the most recent analysis is displayed, the output is saved to a file named name_op1.ext, name_op2.ext etc., where the number 1, 2 etc is incremented with each save. If a specific cluster set selected from the drop-down list is displayed, then the file is named name_Best_n_classes.txt, where n is the number of classes detected in that analysis.
If you re-start an analysis session, either by reloading the data file, or by restarting the program, both the best score and the output file numbering is reset. This means that you should rename any output files that you wish to keep before restarting an analysis on the same source file.
Output results can be read back into the program, so that a particular cluster analysis can be displayed. See Loading partition files for details.
The output file has the following format
Klustawin output
Number of classes found = 5
Score = 3254
3
4
3
.
.
.
The three lines at the start identify the file type, state the number of clusters found and the score of that analysis run.
The numbers that follow these lines show to which class the data item on the equivalent row in the source file has been assigned. There should obviously be as many numbers as there are data items.
If you click the Copy button in the analysis box in Klustawin, then a column of numbers is placed on the clipboard in text format to show to which class the data item on the equivalent row in the source file has been assigned. There will be as many numbers as there are data items. The cluster set is that currently displayed.
Click the Stats button to see a summary analysis of the clusters. The columns show the cluster number, the count of data items within that cluster, and the mean and standard deviation of the data within the cluster in each dimension.