Hamotzi's Data Mining Log: Update: On Goal for this week

I generated the Berkley datasets. It is definitely much faster than using MySQL. So I have the following datasets:

Validation dataset
Training dataset

Each of these datsets have the following tuples:
<>* *>
i.e for each query, I have the categories the query belongs to and the ngrams. For the ngrams, I have the frequency in the query and the tf-idf (term frequency, inverse document frequency) and the normalized tf-idf values.

I picked category 1 (Computers/Hardware) and I generated a training and validation SVM dataset from the above berkley datasets. The format of the training and validation datasets is:
*

Data Statistics:

the training dataset had only 6+/172- (i.e. 6 instances of this category and 172 instances that did not belong to this category).
the validation dataset had 23+/777-

Results:

Since the data in unbalanced, the results were as expected - horrible.
I tried the RBF kernel with several different parameters - including the ones generated by easy.py and grid.py - no cigar. All the 23 instances were classified as -ve instances i.e. all 800 were classified as ~Computers/Hardware

Possible Next steps:

Validate the tf-idf calculations are correct (I need to double-check the code)
Is it possible that my datasets are corrupt (incorrect?)?
Try another SVM software - SVM Lite
Try another category that has more examples

Hamotzi's Data Mining Log

Wednesday, March 22, 2006

Update: On Goal for this week

0 Comments:

About Me

Previous Posts