Hamotzi's Data Mining Log: May 2005

Software Link: SVMs

Researcher Link: Platt

Platt
Support Vector Machines Researcher - Sequential Minimal Optimization Technique

Reference Site: SVMs

SVM Information

There are several tutorials, publications, etc.

Review: Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Review: Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Author: Thorsten Joachims
Topic: Text Categorization
Approach: Supervised Learning with SVMs

Paper explores the use of SVMs to perform Text Categorization. Claim SVMs rock cause they're fast, robust, efficient and fully automatic - no parameter selection required. (Need a background in SVMs to comprehend the paper)

Interesting points/concepts:

Assignment of text to a category is treated as a binary classification problem - classifier determines if text belongs to this category or not.
Used IDF to build a feature vector.
Used Feature Selection to reduce the dimensions of the Feature Vector - should prevent overfitting.

Several FS opions - DF Thresholding, Chi Square test, term strength criterion
Used information gain criteria as proposed by Yang
?? Feature Selection hurts Text Categorization since there are very few irrelevant features and leads to loss of information

Is'nt this a contradiction?

Hypothesis of why SVMs should work well with Text Categorization

SVMs work well with high d - they do not overfit
SVMs work well with sparse vectors

Comments:

Overview of SVMs is quite high level and theoretical. Details of the algorithm/implementation were not described

Researcher Link: Yiming Yang

Yiming Yang
CMU, Researcher in several topics including Text Categorization

Review: Inductive Learning Algorithms and Representations for Text Categorization

Review: Inductive Learning Algorithms and Representations for Text Categorization

Paper compares 5 techniques for Text Categorization based on - learning speed, realtime classification speed and classification accuracy. The supervised learning techniques are - Find Similar, Naive Bayes, Bayesian Networks, Decision Trees and Support Vector Machines (SVM). In their opinion, Linear SVMs (in particular Platt's SMO - Sequential Minimal Optimization) are the most promising as they are accurate, quick to train and quick to evaluate.
Their dataset for comparison is a collection of hand tagged financial stories from Reuters.

Interesting points/concepts:

Text Categorization is the assignment of text to one more predefined categories based on their content.
"Inductive Learning techniques automatically construct classifiers using labeled training data"
Feature Selection is needed to improve efficiency and efficacy

They used Mutual Information( feature Xi, Category c)
Use this to determine which features should be used

Find Similar Classifier

Weight calculated for Terms based on judged relevant and irrelevant documents

Naive Bayes Classifier was found to be very simple and quite effective
Bayes Net

They used a 2-dependence Bayesian classifier that allows the probability of each feature to be directly influenced by the appearance/non-appearance of at the most two other features
Provided very little improvements over Naive Bayes

SVM

Used the simplest linear version of the SVM - fast and accurate classifiers
They used a method developed by Platt (see the references) to train the SVM classifier

Comments:

Classifiers were not described well
Naive Bayes seems like a quick and dirty solution that might be good enough
SVM is the best way to perform text categorization?

References of Interest:

"A comparative study on feature selection in text categorization"
Fast Training of Support Vector Machines using Sequential Minimization Optimization
Text categorization with support vector machines: Learning with many relevant features
Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval

Review: Web Mining Research: A Survey

Web Mining Research: A Survey

Paper is an exhaustive survey of Web Mining techniques
Interesting points/concepts:

Web Mining categories:

Web Content Mining (focus of my review)
Web Structure Mining
Web Usage Mining

Table on Pg 5 - for each of the above techniques describes the options for Data Representation, Methods and Application Categories
As per cited research - the representation (bag of words, phrased based and hypernym) had no significant effect on performance
Table 3 on Pg 6 - maps authors to techniques used and the application of the techniques

Applications, techniques are quite varied

Comment - Paper has a very rich set of references.

Review: Indexing by Latent Semantic Analysis

Indexing By Latent Semantic Analysis

Objective: "Take advantage of implicit higher-order structure in the association of terms with documents (semantic structure) in order to improve the detection of relevant documents on the basis of terms found in queries."

Approach: Large matrix of documents to terms is decomposed into a set of orthogonal factors using SVD. "Queries are represented as pseudo-document vectors formed from
weighted combinations of terms, and documents with supra-threshold cosine values are returned."

Problem in document retrieval: "The problem is that users want to retrieve on the basis of
conceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a document. There are usually many ways to express a given concept, so the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user’s query will literally match terms in documents that are not of interest to the user."

Interesting points/concepts learnt:

synonymy - many different ways to refer to the same thing - affects recall
polysemy - words have more than one meaning - affects precision
Related Word in current techniques are treated independently - e.g. New York - though the words might occur together in large number of instances.
Build a term-doc matrix. Use SVD (two mode factor analysis) to derive model.

Comments:

Their paper is very interesting - describe the challenges in information retrieval very well
Their solution in their words - "modestly encouraging".
This paper is highly cited.
Is this a good approach?

Reference: Data Clustering

Data Clustering: A Review

Exhaustive reference on Clustering - lots of references.

Paper review: A Comparison of Document Clustering Techniques

A Comparison of Document Clustering Techniques

They compare K-means - std and bisecting and HAC techniques. They found Bisecting K-means to be the best solution.

Interesting points/concepts learnt:

Vector Space Model. Each doc is a vector d in the "term-space".

Term Frequency representation: Vector d = {tf1, tf2,...tfn} [tf is the term frequency in doc d]
Inverse Doc Frequency (IDF): discounts words with high frequency since they have little discriminating power

Calculation of Centroids (and inter-cluster similarity) for Clusters created using the Cosine Similarity measure
Bisecting K-Means is far more efficient than K-Means and for document clustering, creates better quality clusters

Comments:

Several interesting references:

"Fast and Intuitive Clustering of Web
Documents" - describes the use of Document Clustering to organize the results returned by a search engine in response to a user's query
"Hierarchically classifying documents using very few words" - generating hierarchical clusters of documents
"On the merits of building categorization
systems by supervised clustering" - finds natural clusters in an already existing document taxonomy and then uses these clusters to produce an effective document classifier for new documents

Calculation of centroid and inter-cluster similarity - can use this in my ADS paper. Can use this as another metric to understand the nature of hacks and attempt to detect them based on any observed patterns.
CLUTO is perfect the tool for the job - Cosine similarity measure, Sparse and Dense matricies, high-d and large datasets and Bisecting K-Means
Can I use the knowledge of the centroid to make my technique more scalable? Instead of maintaining the entire dataset in memory - maintain the centroid of each cluster? (like BIRCH)

Excellent document clustering paper

Paper on document clustering

Intersting points/concepts learnt:

Vector Space Model - Map unstructured docs to a structured vector format based on text contained in the document. Dimensions of space are the complete set of terms (high d and sparse). Each document is mapped to a "concept" vector. The feature values could be the frequency of the terms and other frequency variants.
Data Pre-processing - Normalize the text prior to creating the concept vectors - stemming, stop words, capitilization, etc. Remove words that provide little discriminating value - words with too high frequency or very low frequency. Domain knowledge can be of great help.
Cosine Similarity measure - Good for sparse text data. Measure of the angle between 2 vectors.
Clustering options - K-means and Hierarchical Clustering.
Hierarchical Clustering - Group Avg found to be the best of the HAC techniques.
K-Means

Sensitive to parameter choices - k and initial start points
k determined using Cluster ensembles.
found to be much better than Hierarchical Clustering

Cluster ensembles

Vary k and initial start points
Based on experimentation - come up with a consensus k and start points

Challenges:

High dimensional and sparse dataset
Domain knowledge for data pre-processing
(Related to above) Replacing words by their synonyms

Comments

Finding the natural number of Clusters using Cluster ensembles. I could use the proposed technique in this paper to fix the hole in my Anomaly Detection IDS Approach.
K-Means scalability. K-Means is fast - but it requires all the data to be in memory. Could Scalable K-Means or other online Clustering techniques such as BIRCH be a "better" approach?
Maintaining synonyms seems like an onerous task. Is the "Latent Semantic Indexing" Approach a solution?

Machine Learning, Neural and Statistical Classification (Online Book)

Machine Learning, Neural and Statistical Classification

Seems like a good book

Website Created

Created website.

Under construction - lot more to come ... soon!

Google IDS resource

Google IDS Directory

Distribution-based Artificial Anomaly Generation

Wei Fan, et. al propose an interesting technique for generation of artificial anomaly data. The technique is independent of the learner.
The possible issue is that they generate anomalies "dimension by dimension" - they treat each dimension independently. They propose that anomalies could consider multiple dimensions concurrently -or have different weights on dimensions.
They use this technique to "define decision boundaries that separate the given class labels".

I plan to use this technique to train my Anomaly detector with the DARPA dataset.

Researcher Link

Wei Fan. IDS/Anomaly Detection Research Papers.

Search Engines tailored for a domain

Google is general purpose Search Engine. Do specific domains (such as Network Security, etc.?) require a more specialized Search Engine that understands the domain and presents search options tailored to it?

Anomaly detection for Intrusion Detection

Opportunity

Unlike Signature based techniques, Unsupervised Data Mining techniques can detect new/novel attacks
Data Mining techniques can be used in conjunction with Signature based Systems such as SNORT to handle its blindspot

SNORT is rules-based. If its a new attack???

Feasibility

Unsupervised Data Mining techniques are in their infancy

Cannot be effectively used in real-time (only offline analysis)

Large number of False Alerts

Some require attack-free data for training

Still opportunity exists

Clustering reduces the scale of data - from millions of records to a few hundred clusters. For each Cluster, you have a prototypical instance or a centroid that other elements are similar to. Can this be used to detect a Distributed DoS attack?
Outlier detection

Can be a pre-processing step used by Security companies to analyzing attack data in order to right rules in SNORT

Whacky ideas

Mine CERT alerts to determine to automatically build SNORT rules?
Build a search engine for Security alerts? (Like a Lexis Nexus but tailored for the domain of Security)

Analysis of the ANF Paper

ANF: A Fast and Scalable Tool for Data Mining in Massive Graphs

Authors
Christopher R. Palmer,
Phillip B. Gibbons
Christos Faloutsos, KDD 2002

Overview
What is it?
- fast and memory efficient algorithm for approximating the complete neighborhood function for a graph
- can use it answer "questions" on graph-represented data

Sample applications
- How robust is the Internet to failures?
- Analyze calling graph from telephone companies to detect fraud or marketing opportunities
- Compare graphs (&subgraphs)- find similarities
- Analyze structure - is it hierarchical...
- Perform Clustering
- Determine important verticies

Claim to fame
- very, very fast
- approximate algorithm but still very accurate compared to others
- can handle disk resident graphs

Sample Usage Scenario in Paper
Tic Tac Toe
- graph where vertices are valid boards (|V|= ?)
- edge between vertices indicates a possible move
- compute the speed by which each of the 9 possible starting moves can lead to victory
- each position on the board is given an "ANF importance" factor - center had the highest factor

How can this be used for Anomaly detection for IDS?
High Level Approach
Represent your attack free training data as a graph, the testing data as another graph. Perform delta analysis to find anomalies.
Does not seem like a natural fit as what should your vertices and edges represent?
Should vertices be unique ip addresses? Does an undirected edge represent that the two ip addresses have communicated? How does the edge map to the features of that message communication.

Hello World

First Post. Does this work?

Hamotzi's Data Mining Log

Tuesday, May 31, 2005

Software Link: SVMs

Researcher Link: Platt

Monday, May 30, 2005

Reference Site: SVMs

Review: Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Researcher Link: Yiming Yang

Sunday, May 29, 2005

Review: Inductive Learning Algorithms and Representations for Text Categorization

Review: Web Mining Research: A Survey

Saturday, May 28, 2005

Review: Indexing by Latent Semantic Analysis

Reference: Data Clustering

Paper review: A Comparison of Document Clustering Techniques

Excellent document clustering paper

Friday, May 27, 2005

Machine Learning, Neural and Statistical Classification (Online Book)

Thursday, May 26, 2005

Website Created

Wednesday, May 18, 2005

Google IDS resource

Tuesday, May 17, 2005

Distribution-based Artificial Anomaly Generation

Researcher Link

Monday, May 16, 2005

Search Engines tailored for a domain

Anomaly detection for Intrusion Detection

Analysis of the ANF Paper

Hello World

About Me

Links

Previous Posts

Archives