Question:
I have a selection from texts. doc2vec
from through doc2vec
from gensim
library. The result is good. Determines similar texts with a bang. How can texts be clustered?
I tried to do this: I got a vector for each text. I threw it all into k-means
. The result is not very good.
What other approaches can you use with a trained doc2vec
model?
Answer:
Since k-means
only works with Euclidean distance, I suggest paying attention to a similar k-medoids
. It differs from the previous one in that in the latter, any distance can be used (in this case, cosine is suitable). The only drawback is that k-medoids
more time-consuming than k-means
.
A full comparison of algorithms is offered here: https://stackoverflow.com/questions/21619794/what-makes-the-distance-measure-in-k-medoid-better-than-k-means