Hamotzi's Data Mining Log: KDD Cup: The Challenge

Saturday, June 04, 2005

KDD Cup: The Challenge

It's a Text Categorization challenge.
You're provided with 3 files:

Categories file. Contains a list of 74 Categories.
Queries file. Contains a list of 800,000 queries that need to be categorized/classified. A query can belong to at the most 5 categories.
CategoriesQueriesSample file (training file). Contains 26 queries mapped to 52 of the 74 categories.

Categories

The categories are all hierarchical - "Shopping\Online Stores"
Not all categories are present in the training dataset (22 missing) which is an issue.

Queries

Each query consists of a number of terms
Preparing the query date for training and prediciting, is one of the fundemental challenges of this competition. This involves:

Decomposing queries into a bag of words
Cleaning the words (stemming, splitting, etc.)

Size of the dataset and size of the features is another challenge

To summarize, the challenges are:

Dataset vast - large number of queries, large number of unique terms (features)
Data pre-processing - Converting queries to bag of words
Training dataset too small
Training dataset does not contain data items for all categories
How do you measure success? No validation data set. Training dataset too small to effectively predict learning error.

0 Comments: