Update: On Goal for this week
I generated the Berkley datasets. It is definitely much faster than using MySQL. So I have the following datasets:
<>**>
i.e for each query, I have the categories the query belongs to and the ngrams. For the ngrams, I have the frequency in the query and the tf-idf (term frequency, inverse document frequency) and the normalized tf-idf values.
I picked category 1 (Computers/Hardware) and I generated a training and validation SVM dataset from the above berkley datasets. The format of the training and validation datasets is:
*
Data Statistics:
- Validation dataset
- Training dataset
<>*
i.e for each query, I have the categories the query belongs to and the ngrams. For the ngrams, I have the frequency in the query and the tf-idf (term frequency, inverse document frequency) and the normalized tf-idf values.
I picked category 1 (Computers/Hardware) and I generated a training and validation SVM dataset from the above berkley datasets. The format of the training and validation datasets is:
Data Statistics:
- the training dataset had only 6+/172- (i.e. 6 instances of this category and 172 instances that did not belong to this category).
- the validation dataset had 23+/777-
- Since the data in unbalanced, the results were as expected - horrible.
- I tried the RBF kernel with several different parameters - including the ones generated by easy.py and grid.py - no cigar. All the 23 instances were classified as -ve instances i.e. all 800 were classified as ~Computers/Hardware
- Validate the tf-idf calculations are correct (I need to double-check the code)
- Is it possible that my datasets are corrupt (incorrect?)?
- Try another SVM software - SVM Lite
- Try another category that has more examples
0 Comments:
Post a Comment
<< Home