I want perform cluster analysis for the following data (sample):
ID CODE1 CODE2 CODE3 CODE4 CODE5 CODE6
------------------------------------------------------------------
00001 0 1 1 0 0 0
00002 1 0 0 0 1 1
00003 0 1 0 1 1 1
00004 1 1 1 0 1 0
...
Where 1 indicates the presence of that code for a person, and 0 the absence.. Is k-means or hierarchical clustering most appropriate for clustering the codes for this kind of data (for about a million distinct ids), and with which distance measure? If neither of these methods are appropriate, what do you think is most appropriate?
Thank you