Monday, May 29, 2006

Categories Frequency Data

I tried posting a graph - but - no cigar.
Here is the raw data:

Category Labeled Validation
1 6 23
2 10 59
3 4 7
4 5 51
5 4 25
6 4 20
7 5 11
8 5 60
9 9 85
10 6 71
11 3 46
12 6 60
13 6 69
14 5 75
15 4 55
16 3 5
17 4 34
18 38 241
19 41 399
20 10 116
21 10 126
22 20 252
23 6 77
24 9 190
25 15 141
26 4 91
27 4 61
28 6 40
29 5 25
30 9 125
31 5 80
32 4 68
33 8 44
34 10 70
35 8 66
36 4 99
37 4 13
38 2 29
39 3 28
40 4 32
41 3 25
42 4 48
43 12 100
44 3 7
45 4 135
46 9 365
47 2 11
48 5 40
49 3 10
50 3 16
51 7 51
52 24 301
53 3 11
54 2 63
55 36 332
56 4 6
57 4 9
58 3 5
59 4 9
60 4 2
61 4 11
62 3 2
63 5 52
64 3 37
65 10 9
66 3 0
67 3 3

Category Frequencies

The chart below compares the frequencies of the categories in the learning dataset and the validation (testing) datasets.

Categories

The categories for the queries are: (format is id, category string)
1 Computers\Hardware
2 Computers\Internet & Intranet
3 Computers\Mobile Computing
4 Computers\Multimedia
5 Computers\Networks & Telecommunication
6 Computers\Other
7 Computers\Security
8 Computers\Software
9 Entertainment\Celebrities
10 Entertainment\Games & Toys
11 Entertainment\Humor & Fun
12 Entertainment\Movies
13 Entertainment\Music
14 Entertainment\Other
15 Entertainment\Pictures & Photos
16 Entertainment\Radio
17 Entertainment\TV
18 Information\Arts & Humanities
19 Information\Companies & Industries
20 Information\Education
21 Information\Law & Politics
22 Information\Local & Regional
23 Information\Other
24 Information\References & Libraries
25 Information\Science & Technology
26 Living\Book & Magazine
27 Living\Car & Garage
28 Living\Career & Jobs
29 Living\Dating & Relationships
30 Living\Family & Kids
31 Living\Fashion & Apparel
32 Living\Finance & Investment
33 Living\Food & Cooking
34 Living\Furnishing & Houseware
35 Living\Gifts & Collectables
36 Living\Health & Fitness
37 Living\Landscaping & Gardening
38 Living\Other
39 Living\Pets & Animals
40 Living\Real Estate
41 Living\Religion & Belief
42 Living\Tools & Hardware
43 Living\Travel & Vacation
44 Online Community\Chat & Instant Messaging
45 Online Community\Forums & Groups
46 Online Community\Homepages
47 Online Community\Other
48 Online Community\People Search
49 Online Community\Personal Services
50 Shopping\Auctions & Bids
51 Shopping\Bargains & Discounts
52 Shopping\Buying Guides & Researching
53 Shopping\Lease & Rent
54 Shopping\Other
55 Shopping\Stores & Products
56 Sports\American Football
57 Sports\Auto Racing
58 Sports\Baseball
59 Sports\Basketball
60 Sports\Hockey
61 Sports\News & Scores
62 Sports\Olympic Games
63 Sports\Other
64 Sports\Outdoor Recreations
65 Sports\Schedules & Tickets
66 Sports\Soccer
67 Sports\Tennis

Cluster Results Discussion

For each of the 5000 clusters, I picked the top 5 prototypes. Therefore in all I have 25,000 queries/vectors.

I wrote a tool (Web Application) that allows me to manually categorize the 25,000 queries. While manually categorizing some of them, I've noticed that the clustering results were "fairly" good.

Let's analyze cluster 54.
  • Total queries: 116
  • Internal similarity: +0.411
  • Internal standard deviation: +0.118,
  • External similarity: +0.003
  • External std deviation: +0.001
Top 5 Queries:
  • child abuse bands
  • child abuse adalah
  • child abuse and education
  • child abuse in amercia
  • child abuse headlines
Remember the objective is to classify these queries into 3 of 67 possible classifications.

Manual classification of the queries (subjective):
(1) child abuse bands
  • Living\Family & Kids
  • Information\Other
  • Living\Health & Fitness
(2) child abuse adalah
  • Living\Family & Kids
  • Local & Regional
  • Living\Health & Fitness
(3) child abuse and education
  • Living\Family & Kids
  • Local & Regional
  • Information\Education
(4) child abuse in amercia
  • Living\Family & Kids
  • Local & Regional
  • Living\Health & Fitness
(5) child abuse headlines
  • Living\Family & Kids
  • Local & Regional
  • Information\References & Libraries
Some observations:
  • the queries were grouped together 'cause they share a large number of words [note: the order of the words is not used in clustering]
  • Google searches using these query strings results in quite different results
  • None of the categories were really apropos
  • The categories I placed them in, were quite similar