Information retrieval

Information retrieval is defined as "a branch of computer or library science relating to the storage, locating, searching, and selecting, upon demand, relevant data on a given subject." As noted by Carl Sagan, "human beings have, in the most recent few tenths of a percent of our existence, invented not only extra-genetic but also extrasomatic knowledge: information stored outside our bodies, of which writing is the most notable example." The benefits of enhancing personal knowledge with retrieval of extrasomatic knowledge has been shown in a controlled comparison with rote memory.

Although information retrieval is usually thought of being done by computer, retrieval can also be done by humans for other humans. In addition, some Internet search engines such as mahalo.com and http://www.chacha.com/ may have human supervision or editors.

Some Internet search engines such http://www.deeppeep.org and http://www.deepdyve.com/ as attempt to index the Deep Web which is web pages that are not normally public.

Classification by user purpose
Information retrieval can be divided into information discovery, information recovery, and information awareness.

Information discovery
Information discovery is searching for information that the searcher has not seen before and the searcher does not know for sure that the information exists. Information discovery includes searching in order to answer a question at hand, or searching for a topic without a specific question in order to improve knowledge of a topic.

Information recovery
Information recovery is searching for information that the searcher has seen before and knows to exist.

Information awareness
Information awareness has also been described as "'systematic serendipity' - an organized process of information discovery of that which he [the searcher] did not know existed". Examples of this prior to the Internet include reading print and online periodicals. With the Internet, new methods include email newsletters, email alerts, and RSS feeds.

Document retrieval

 * Boolean
 * Vector space model (relevancy)
 * Probabilistic (Bayes)

Factors associated with unsuccessful retrieval
The field of medicine provides much research on the difficulties of information retrieval. Barriers to successful retrieval include:
 * Lack of prior experience with the information retrieval system being used
 * Low visual spatial ability
 * Poor formulation of the question to be searched
 * Difficulty designing a search strategy when multiple resources are available
 * "Uncertainty about how to know when all the relevant evidence has been found so that the search can stop"
 * Difficulty synthesizing an answer across multiple documents

Characteristics of how the information is stored
For storage of text content, the quality of the index to the content is important. For example, the use of stemming, or truncating, words by removing suffixes may help.

Display of information
Information that is structured was found to be more effective in a controlled study. In addition, the structure should be layered with a summary of the content being the first layer that the readers sees. This allows the reader to take only an overview, or choose more detail. Some Internet search engines such as http://www.kosmix.com/ try to organize search results beyond a one dimensional list of results.

Regarding display of results from search engines, an interface designed to reduce anchoring and order bias may improve decision making.

Characteristics of the search engine
John Battelle has described features of the perfect search engine of the future. For example, the use of Boolean searching may not be as efficient.

Characteristics of the searcher
In healthcare, searchers are more likely to be successful if their answer is answer before searching, they have experience with the system they are searching, and they have a high spatial visualization score. Also in healthcare, physicians with less experience are more likely to want more information. Physicians who report stress when uncertain are more likely to search textbooks than source evidence.

In healthcare, using expert searchers on behalf of physicians led to increased satisfaction by the physicians with the search results.

Evaluation of the quality of information retrieval
Various methods exist to evaluate the quality of information retrieval. Hersh noted the classification of evaluation developed by Wancaster and Warner in which the first level of evaluation is:
 * Costs/resources consumed in learning and using a system
 * Time needed to use the sytem
 * Quality of the results.
 * Coverage. An estimated of coverage can be crudely automated. However, more accurate judgment of relevance requires a human judge which introduces subjectivity.
 * Precision and recall
 * Novelty. This has been judged by independent reviewers.
 * Completeness and accuracy of results. An easy method of assessing this is to let the searcher make a subjective assessment.

Precision and recall
Recall is the fraction of relevant documents that are successfully retrieved. This is the same as sensitivity.


 * $$ \mbox{recall}=\frac{|\{\mbox{relevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{relevant documents}\}|} $$

Precision is the fraction of retrieved documents that are relevant to the search. This is the same as positive predictive value.


 * $$ \mbox{precision}=\frac{|\{\mbox{relevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{retrieved documents}\}|} $$

F1 is the unweighted harmonic mean of the recall and precision.