Web-based Document Search Challenge
The book - "Text Mining: Predictive Methods for Analyzing Unstructured Information" makes an interesting comparison between an Information Retrieval application (such as Web-based search) and Text categorization.
Text categorization/prediction is a classification problem. We're concerned with determining a label (or a set of labels) for a new document based on its similarity to known, labeled documents. In IR, one uses a query (can be considered as a small document) to perform a similarity based search with documents in a document collection. The difference being, we are interested in ranking the search results rather than categorizing the query. Since the query contains very few words and the number of documents to be searched is massive, a large number of documents will have all the key words in the query and will thus have nearly identical similarity scores. The returned documents cannot be ranked based on these nearly identical similarity scores. It will be the responsibility of the user to refocus the query by adding more words or by substituting more specific words to find the documents they are looking for. This is the challenge of Web-based document search.
The general approarch to this problem is to perform document link analysis in a query-independent manner. Each document is given a score by a page rank function. The larger the score, the higher the quality of the page. The documents returned to the user are ranked based on these scores.
Text categorization/prediction is a classification problem. We're concerned with determining a label (or a set of labels) for a new document based on its similarity to known, labeled documents. In IR, one uses a query (can be considered as a small document) to perform a similarity based search with documents in a document collection. The difference being, we are interested in ranking the search results rather than categorizing the query. Since the query contains very few words and the number of documents to be searched is massive, a large number of documents will have all the key words in the query and will thus have nearly identical similarity scores. The returned documents cannot be ranked based on these nearly identical similarity scores. It will be the responsibility of the user to refocus the query by adding more words or by substituting more specific words to find the documents they are looking for. This is the challenge of Web-based document search.
The general approarch to this problem is to perform document link analysis in a query-independent manner. Each document is given a score by a page rank function. The larger the score, the higher the quality of the page. The documents returned to the user are ranked based on these scores.
0 Comments:
Post a Comment
<< Home