Monday, August 29, 2005

tf-idf and Cosine Similarity

tf-idf
In text mining, documents are represented by vectors. The components of the vector are the weight of the words that appear in the document. The weight is computed using tf-idf (term frequency-inverse document frequency).
tf-idf(word) = term frequency in the document(word) * log-base-2(Number of documents in the collection/number of documents the word appears in(word)). tf-idf measure can be considered as a measure of importance or relevance to the document. tf-idf gives extra weighting to high-frequency words and words that are relatively unique.

Cosine Similarity
The most popular document similarity measure is Cosine Similarity. Cosine similarity is the cosine of the angle between two vectors. If vectors are orthogonal, the similarity is 0. If they are identical, the similarity is 1. Cosine(V1, V2) = ( V1 * V2 ) / (|V1| * |V2| ). V1 * V2 = v1i * v2i + v1i+1 * v2i+1 + ... + v1n * v2n. |V1| = v1i^2 + ... + v1n^2. The document vectors are normally represented using tf-idf. By using the length of the unit vectors in the denominator, we normalize the frequency information in the tf-idf weights since documents are of variable length.

0 Comments:

Post a Comment

<< Home