Thursday, January 23, 2014

Unit 3 Reading Note (1/27)

IIR Chapter 4 - Index Construction

Hardware basics: access memory is faster than disk (caching); Seek time (the time to locate data on disk); Buffer (the part of main memory where a block being read or written).

Blocked sort-based indexing (termID-docID pairs) uses external sorting algorithm: sort the pairs, then collect all termID-docID pairs with the same termID into a postings list. O(TlogT).

Single-pass in-memory indexing (term-docID pairs) adds a posting directly to its postings list. Fast and memory saving. O(T).

Distributed indexing: MapReduce by master node; parses partition keys into terms and inverters collect all values for a given key into one list.

Dynamic indexing (reconstruct the index from scratch) solution: maintain two indexes--a large main index and a small auxiliary index that stores new documents. Logarithmic merging O(Tlog (T/n)).

Chapter 5 - Index Compression

Benefits of index compression: uncreased use of caching; faster transfer of data from disk to memory; conserve disk space.

Statistical properties of terms: Heap's law (Estimating the number of terms) and Zipf's law (Modeling the distribution of terms: cf ~ 1/i)

Dictionary Compression: Fixed-width, Dictionary as a string (storing the dictionary terms as one long string), Blocked storage (eliminate k−1 term pointers, but need an additional k bytes for storing the length of each term; time consuming) and Blocking & Front coding.

Postings File Compression (using fewer bits for short gaps): Variable byte codes (the first bit of the byte: a continuation bit (1 for the last byte)) and R codes (unary code, gap into length and offset).

No comments:

Post a Comment