Saturday, June 04, 2005

KDD Cup: Brain Storming

Learning algorithm options:
  1. Supervised Learning algorithms, or
  2. Unsupervised Learning algorithms or
  3. Latent Semantic Indexing + Supervised Learning?
Data Pre-processing
  1. Use standard techniques - Porter stemming, etc,
  2. Custom code based on analyzing characteristics of the dataset
  3. Build a custom synonym dictionary? (or use LSI techniques?)
  4. Ploysomy is not an issue because we can map the query to upto 5 categories
  5. Need to greatly reduce the number of unique words from about 799,000+ to something more "manageable" - depends on the algorithm what "manageable" means
  6. Need to enhance the training dataset to ensure the training set contains examples of all categories
Determining validity of results
  1. Hand build a custom validation set

0 Comments:

Post a Comment

<< Home