Wednesday, March 22, 2006

Update: On Goal for this week

I generated the Berkley datasets. It is definitely much faster than using MySQL. So I have the following datasets:
  1. Validation dataset
  2. Training dataset
Each of these datsets have the following tuples:
<>* *>
i.e for each query, I have the categories the query belongs to and the ngrams. For the ngrams, I have the frequency in the query and the tf-idf (term frequency, inverse document frequency) and the normalized tf-idf values.

I picked category 1 (Computers/Hardware) and I generated a training and validation SVM dataset from the above berkley datasets. The format of the training and validation datasets is:

Data Statistics:
  • the training dataset had only 6+/172- (i.e. 6 instances of this category and 172 instances that did not belong to this category).
  • the validation dataset had 23+/777-
  • Since the data in unbalanced, the results were as expected - horrible.
  • I tried the RBF kernel with several different parameters - including the ones generated by and - no cigar. All the 23 instances were classified as -ve instances i.e. all 800 were classified as ~Computers/Hardware
Possible Next steps:
  • Validate the tf-idf calculations are correct (I need to double-check the code)
  • Is it possible that my datasets are corrupt (incorrect?)?
  • Try another SVM software - SVM Lite
  • Try another category that has more examples


Post a Comment

<< Home