I have points with binary features:
id, feature 1, feature 2, ....
1, 0, 1, 0, 1, ...
2, 1, 1, 0, 1, ...
and the size of matrix is about 20k * 200k but it is sparse. I am using Mahout for clustering data by kmeans algorithm and have the following questions:
- Is kmeans a good candidate for binary features?
- Is there any way to reduce dimensions while keeping the concept of Manhattan distance measure (I need manhattan instead of Cosine or Tanimoto)
- The memory usage of kmeans is high and needs 4GB memory for each Map/Reduce Task on (4Mb Blocks on 400Mb vector file for 3k clusterss). Considering that Vector object in Mahout uses double entries, is there any way to use just Boolean entries for points but double entries for centers?