Review: Indexing by Latent Semantic Analysis
Indexing By Latent Semantic Analysis
Objective: "Take advantage of implicit higher-order structure in the association of terms with documents (semantic structure) in order to improve the detection of relevant documents on the basis of terms found in queries."
Approach: Large matrix of documents to terms is decomposed into a set of orthogonal factors using SVD. "Queries are represented as pseudo-document vectors formed from
weighted combinations of terms, and documents with supra-threshold cosine values are returned."
Problem in document retrieval: "The problem is that users want to retrieve on the basis of
conceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a document. There are usually many ways to express a given concept, so the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user’s query will literally match terms in documents that are not of interest to the user."
Interesting points/concepts learnt:
Objective: "Take advantage of implicit higher-order structure in the association of terms with documents (semantic structure) in order to improve the detection of relevant documents on the basis of terms found in queries."
Approach: Large matrix of documents to terms is decomposed into a set of orthogonal factors using SVD. "Queries are represented as pseudo-document vectors formed from
weighted combinations of terms, and documents with supra-threshold cosine values are returned."
Problem in document retrieval: "The problem is that users want to retrieve on the basis of
conceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a document. There are usually many ways to express a given concept, so the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user’s query will literally match terms in documents that are not of interest to the user."
Interesting points/concepts learnt:
- synonymy - many different ways to refer to the same thing - affects recall
- polysemy - words have more than one meaning - affects precision
- Related Word in current techniques are treated independently - e.g. New York - though the words might occur together in large number of instances.
- Build a term-doc matrix. Use SVD (two mode factor analysis) to derive model.
- Their paper is very interesting - describe the challenges in information retrieval very well
- Their solution in their words - "modestly encouraging".
- This paper is highly cited.
- Is this a good approach?
0 Comments:
Post a Comment
<< Home