Saturday, May 28, 2005

Review: Indexing by Latent Semantic Analysis

Indexing By Latent Semantic Analysis

Objective: "Take advantage of implicit higher-order structure in the association of terms with documents (semantic structure) in order to improve the detection of relevant documents on the basis of terms found in queries."

Approach: Large matrix of documents to terms is decomposed into a set of orthogonal factors using SVD. "Queries are represented as pseudo-document vectors formed from
weighted combinations of terms, and documents with supra-threshold cosine values are returned."

Problem in document retrieval: "The problem is that users want to retrieve on the basis of
conceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a document. There are usually many ways to express a given concept, so the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user’s query will literally match terms in documents that are not of interest to the user."

Interesting points/concepts learnt:
  1. synonymy - many different ways to refer to the same thing - affects recall
  2. polysemy - words have more than one meaning - affects precision
  3. Related Word in current techniques are treated independently - e.g. New York - though the words might occur together in large number of instances.
  4. Build a term-doc matrix. Use SVD (two mode factor analysis) to derive model.
Comments:
  1. Their paper is very interesting - describe the challenges in information retrieval very well
  2. Their solution in their words - "modestly encouraging".
  3. This paper is highly cited.
  4. Is this a good approach?

0 Comments:

Post a Comment

<< Home