Thursday, June 30, 2005

KDD Update

So my results using Multinomial Naive Bayes with some tweaking has worked out well.

Using only 1697 features, I could classifify 75K queries.
This is remarkable considering the fact that there are 83K features and 800K queries.

The other "tweak" that I might perform - given the constraint of time and hardware resources - is to "hand tweak" some of the queries.
For e.g. for "legal action against harrasing creditors" suggest "Information\Law & Politics".
legal assistant jobs in news york city => Information\Law & Politics and Living\Career & Jobs

So its a genetic algorithm approach:
- classify the entire 800K
- scan through and pick manually k% "good" categorizations that you want replicated
- pick another k% that sucked and manually pick categories for them
- add these to the training set
- build a model
- run classifier

This iterative approach should produce better results.

0 Comments:

Post a Comment

<< Home