SPSS K-means & R
What would be the best function/package to use in R to try and replicate the K-means clustering method used in SPSS? Here is an example of the syntax I would use in SPSS:
QUICK CLUSTER VAR1 TO VAR10 /MISSING=LISTWISE /CRITERIA=CLUSTER(5) MXITER(50) CONVERGE(.02) /METHOD=KMEANS(NOUPDATE)
In SPSS, use the /PRINT INITIAL option. This will give you the initial cluster centers, which seem to be fixed in SPSS, but random in R (see ?kmeansfor parameter centers).
If you use the printed initial cluster centers from SPSS output and the argument="Lloyd" parameter in kmeans, you should get the same results (at least it worked for me, testing with several repetitions).
Example of an SPSS-output of the initial cluster centers:
Cluster Cl1 Cl2 Cl3 Cl4 Var A 1 1 4 3 Var B 4 1 4 1 Var C 1 1 1 4 Var D 1 4 4 1 Var E 1 4 1 2 Var F 1 4 4 3
This table, replicated as matrix in R, with kmeans computation:
mat <- matrix(c(1,1,4,3,4,1,4,1,1,1,1,4,1,4,4,1,1,4,1,2,1,4,4,3), nrow=4, ncol=6) kmeans(na.omit(data.frame), centers=mat, iter.max=20, algorithm="Lloyd")
Be sure to use the same amount of maximum iterations in SPSS and R-kemans, and use Lloyd-method in R-kmeans.
However, I don't know whether it's better to have a fixed or a random choice of initial centers. I personally like the random choice, and compute a linear discriminant analysis with the found cluster groups to assess the classification accuracy, and rerun the kmeans clustering until I have a statisfying group classification.
Edit: I found this posting where the SPSS procedure of selecting initial clusters is described. Perhaps somebody knows of an R implementation?