Monday, May 29, 2006

Cluster Results Discussion

For each of the 5000 clusters, I picked the top 5 prototypes. Therefore in all I have 25,000 queries/vectors.

I wrote a tool (Web Application) that allows me to manually categorize the 25,000 queries. While manually categorizing some of them, I've noticed that the clustering results were "fairly" good.

Let's analyze cluster 54.
  • Total queries: 116
  • Internal similarity: +0.411
  • Internal standard deviation: +0.118,
  • External similarity: +0.003
  • External std deviation: +0.001
Top 5 Queries:
  • child abuse bands
  • child abuse adalah
  • child abuse and education
  • child abuse in amercia
  • child abuse headlines
Remember the objective is to classify these queries into 3 of 67 possible classifications.

Manual classification of the queries (subjective):
(1) child abuse bands
  • Living\Family & Kids
  • Information\Other
  • Living\Health & Fitness
(2) child abuse adalah
  • Living\Family & Kids
  • Local & Regional
  • Living\Health & Fitness
(3) child abuse and education
  • Living\Family & Kids
  • Local & Regional
  • Information\Education
(4) child abuse in amercia
  • Living\Family & Kids
  • Local & Regional
  • Living\Health & Fitness
(5) child abuse headlines
  • Living\Family & Kids
  • Local & Regional
  • Information\References & Libraries
Some observations:
  • the queries were grouped together 'cause they share a large number of words [note: the order of the words is not used in clustering]
  • Google searches using these query strings results in quite different results
  • None of the categories were really apropos
  • The categories I placed them in, were quite similar

0 Comments:

Post a Comment

<< Home