Hamotzi's Data Mining Log: Cluster Results Discussion

For each of the 5000 clusters, I picked the top 5 prototypes. Therefore in all I have 25,000 queries/vectors.

I wrote a tool (Web Application) that allows me to manually categorize the 25,000 queries. While manually categorizing some of them, I've noticed that the clustering results were "fairly" good.

Let's analyze cluster 54.

Total queries: 116
Internal similarity: +0.411
Internal standard deviation: +0.118,
External similarity: +0.003
External std deviation: +0.001

Top 5 Queries:

child abuse bands
child abuse adalah
child abuse and education
child abuse in amercia
child abuse headlines

Remember the objective is to classify these queries into 3 of 67 possible classifications.

Manual classification of the queries (subjective):
(1) child abuse bands

Living\Family & Kids
Information\Other
Living\Health & Fitness

(2) child abuse adalah

Living\Family & Kids
Local & Regional
Living\Health & Fitness

(3) child abuse and education

Living\Family & Kids
Local & Regional
Information\Education

(4) child abuse in amercia

Living\Family & Kids
Local & Regional
Living\Health & Fitness

(5) child abuse headlines

Living\Family & Kids
Local & Regional
Information\References & Libraries

Some observations:

the queries were grouped together 'cause they share a large number of words [note: the order of the words is not used in clustering]
Google searches using these query strings results in quite different results
None of the categories were really apropos
The categories I placed them in, were quite similar

Hamotzi's Data Mining Log

Monday, May 29, 2006

Cluster Results Discussion

0 Comments:

About Me

Previous Posts