KDD Update
So my results using Multinomial Naive Bayes with some tweaking has worked out well.
Using only 1697 features, I could classifify 75K queries.
This is remarkable considering the fact that there are 83K features and 800K queries.
The other "tweak" that I might perform - given the constraint of time and hardware resources - is to "hand tweak" some of the queries.
For e.g. for "legal action against harrasing creditors" suggest "Information\Law & Politics".
legal assistant jobs in news york city => Information\Law & Politics and Living\Career & Jobs
So its a genetic algorithm approach:
- classify the entire 800K
- scan through and pick manually k% "good" categorizations that you want replicated
- pick another k% that sucked and manually pick categories for them
- add these to the training set
- build a model
- run classifier
This iterative approach should produce better results.
Using only 1697 features, I could classifify 75K queries.
This is remarkable considering the fact that there are 83K features and 800K queries.
The other "tweak" that I might perform - given the constraint of time and hardware resources - is to "hand tweak" some of the queries.
For e.g. for "legal action against harrasing creditors" suggest "Information\Law & Politics".
legal assistant jobs in news york city => Information\Law & Politics and Living\Career & Jobs
So its a genetic algorithm approach:
- classify the entire 800K
- scan through and pick manually k% "good" categorizations that you want replicated
- pick another k% that sucked and manually pick categories for them
- add these to the training set
- build a model
- run classifier
This iterative approach should produce better results.
0 Comments:
Post a Comment
<< Home