Saturday, June 04, 2005

KDD Cup: The Challenge

It's a Text Categorization challenge.
You're provided with 3 files:
  1. Categories file. Contains a list of 74 Categories.
  2. Queries file. Contains a list of 800,000 queries that need to be categorized/classified. A query can belong to at the most 5 categories.
  3. CategoriesQueriesSample file (training file). Contains 26 queries mapped to 52 of the 74 categories.
Categories
  • The categories are all hierarchical - "Shopping\Online Stores"
  • Not all categories are present in the training dataset (22 missing) which is an issue.
Queries
  • Each query consists of a number of terms
  • Preparing the query date for training and prediciting, is one of the fundemental challenges of this competition. This involves:
    • Decomposing queries into a bag of words
    • Cleaning the words (stemming, splitting, etc.)
  • Size of the dataset and size of the features is another challenge
To summarize, the challenges are:
  1. Dataset vast - large number of queries, large number of unique terms (features)
  2. Data pre-processing - Converting queries to bag of words
  3. Training dataset too small
  4. Training dataset does not contain data items for all categories
  5. How do you measure success? No validation data set. Training dataset too small to effectively predict learning error.

0 Comments:

Post a Comment

<< Home