Hamotzi's Data Mining Log: Review: Indexing by Latent Semantic Analysis

Indexing By Latent Semantic Analysis

Objective: "Take advantage of implicit higher-order structure in the association of terms with documents (semantic structure) in order to improve the detection of relevant documents on the basis of terms found in queries."

Approach: Large matrix of documents to terms is decomposed into a set of orthogonal factors using SVD. "Queries are represented as pseudo-document vectors formed from
weighted combinations of terms, and documents with supra-threshold cosine values are returned."

Problem in document retrieval: "The problem is that users want to retrieve on the basis of
conceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a document. There are usually many ways to express a given concept, so the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user’s query will literally match terms in documents that are not of interest to the user."

Interesting points/concepts learnt:

synonymy - many different ways to refer to the same thing - affects recall
polysemy - words have more than one meaning - affects precision
Related Word in current techniques are treated independently - e.g. New York - though the words might occur together in large number of instances.
Build a term-doc matrix. Use SVD (two mode factor analysis) to derive model.

Comments:

Their paper is very interesting - describe the challenges in information retrieval very well
Their solution in their words - "modestly encouraging".
This paper is highly cited.
Is this a good approach?

Hamotzi's Data Mining Log

Saturday, May 28, 2005

Review: Indexing by Latent Semantic Analysis

0 Comments:

About Me

Previous Posts