Cluster Results Discussion
For each of the 5000 clusters, I picked the top 5 prototypes. Therefore in all I have 25,000 queries/vectors.
I wrote a tool (Web Application) that allows me to manually categorize the 25,000 queries. While manually categorizing some of them, I've noticed that the clustering results were "fairly" good.
Let's analyze cluster 54.
Manual classification of the queries (subjective):
(1) child abuse bands
I wrote a tool (Web Application) that allows me to manually categorize the 25,000 queries. While manually categorizing some of them, I've noticed that the clustering results were "fairly" good.
Let's analyze cluster 54.
- Total queries: 116
- Internal similarity: +0.411
- Internal standard deviation: +0.118,
- External similarity: +0.003
- External std deviation: +0.001
- child abuse bands
- child abuse adalah
- child abuse and education
- child abuse in amercia
- child abuse headlines
Manual classification of the queries (subjective):
(1) child abuse bands
- Living\Family & Kids
- Information\Other
- Living\Health & Fitness
- Living\Family & Kids
- Local & Regional
- Living\Health & Fitness
- Living\Family & Kids
- Local & Regional
- Information\Education
- Living\Family & Kids
- Local & Regional
- Living\Health & Fitness
- Living\Family & Kids
- Local & Regional
- Information\References & Libraries
- the queries were grouped together 'cause they share a large number of words [note: the order of the words is not used in clustering]
- Google searches using these query strings results in quite different results
- None of the categories were really apropos
- The categories I placed them in, were quite similar
0 Comments:
Post a Comment
<< Home