KDD Cup: The Challenge
It's a Text Categorization challenge.
You're provided with 3 files:
You're provided with 3 files:
- Categories file. Contains a list of 74 Categories.
- Queries file. Contains a list of 800,000 queries that need to be categorized/classified. A query can belong to at the most 5 categories.
- CategoriesQueriesSample file (training file). Contains 26 queries mapped to 52 of the 74 categories.
- The categories are all hierarchical - "Shopping\Online Stores"
- Not all categories are present in the training dataset (22 missing) which is an issue.
- Each query consists of a number of terms
- Preparing the query date for training and prediciting, is one of the fundemental challenges of this competition. This involves:
- Decomposing queries into a bag of words
- Cleaning the words (stemming, splitting, etc.)
- Size of the dataset and size of the features is another challenge
- Dataset vast - large number of queries, large number of unique terms (features)
- Data pre-processing - Converting queries to bag of words
- Training dataset too small
- Training dataset does not contain data items for all categories
- How do you measure success? No validation data set. Training dataset too small to effectively predict learning error.
0 Comments:
Post a Comment
<< Home