IS2140: Unit 2 Reading Note (1/13)

IIR Section 1.2

Step of building index: Collect the documents to be indexed; Tokenize the text, turning each document into a list of tokens; Producing a list of normalized tokens; Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

Data structure: linked lists (allow cheap insertion) or variable length arrays (avoid overhead for pointers)

IIR Chapter 2

Tokenization is the process of chopping character streams into tokens, while linguistic prepro- cessing then deals with building equivalence classes of tokens which are the set of terms that are indexed.

The problems with large document units can be alleviated by use of explicit or implicit proximity search.

Tokenization: token, type, term.

The general trend in IR systems: no stop list (stop words).

Normalization: mapping rules and maintaining relations between unnormalized tokens.

The increased flexibility that comes from distinct postings lists is appealing.

Lowercasing everything often remains the most practical solution.

Solution to foreign or complex words: to use heuristics to equiva- lence class or expand terms with phonetic equivalents.

Stemming and Lemmatization.

Faster postings list intersection are archived by skip pointers. The presence of skip pointers only helps for AND queries, not for OR queries.

Phrase queries are supported by Biword indexes (expend the size of vocabulary) and Positional indexes (check positions of terms).

Good queries to include in the phrase index are ones known to be common based on re- cent querying behavior. But this is not the only criterion: the most expensive phrase queries to evaluate are ones where the individual words are com- mon but the desired phrase is comparatively rare.

IIR Chapter 3

Dictionary solutions: hashing and search trees.

Wildcard query handling: Permuterm indexes ($ to mark the end) and k-gram indexes.

Isolated-term correction: edit distance (+permuterm index) and k-gram overlap (Jaccard coefficient, +edit distance).

Phonetic correction algorithms: Soundex Algorithms

IS2140

Thursday, January 9, 2014

Unit 2 Reading Note (1/13)

No comments:

Post a Comment