Thursday, January 30, 2014

Unit 4 Reading Note (2/3)

IIR 1.3 and 1.4

Processing Boolean queries: intersection by merge algorithm
Proximity operator (query closeness)

Chapter 6

Scoring zones and learning weights (machine-learned relevance; weight g determined by enable error equals 0)

Term frequency and weight: tf-idf weighting; tf: term frequency; idf: inverse document frequency (rare term is high)

-->Vector space model for scoring: dot products (weights of terms; cosine similarity) -->queries as vectors (v(q).v(d))

Variant tf-idf functions (alternatives): sublinear tf scaling and maximum tf normalization; scheme (ddd.qqq eg. lnc.ltc: log-weighted tf, no idf and cosine normalizaition (document); log-weighted tf, idf and cn (query))

Pivoted normalized document length (enable dot product score account for the effect of document length on relevance): a normalization factor for each document vector that is not the Euclidean length of that vector, but instead one that is larger than the Euclidean length for documents of length less than lp, and smaller for longer documents.

Monday, January 27, 2014

Unit 3 Muddiest Points (1/27)

During the merge process of dictionary, are the postings merged automatically with terms or dealt separately?

Thursday, January 23, 2014

Unit 3 Reading Note (1/27)

IIR Chapter 4 - Index Construction

Hardware basics: access memory is faster than disk (caching); Seek time (the time to locate data on disk); Buffer (the part of main memory where a block being read or written).

Blocked sort-based indexing (termID-docID pairs) uses external sorting algorithm: sort the pairs, then collect all termID-docID pairs with the same termID into a postings list. O(TlogT).

Single-pass in-memory indexing (term-docID pairs) adds a posting directly to its postings list. Fast and memory saving. O(T).

Distributed indexing: MapReduce by master node; parses partition keys into terms and inverters collect all values for a given key into one list.

Dynamic indexing (reconstruct the index from scratch) solution: maintain two indexes--a large main index and a small auxiliary index that stores new documents. Logarithmic merging O(Tlog (T/n)).

Chapter 5 - Index Compression

Benefits of index compression: uncreased use of caching; faster transfer of data from disk to memory; conserve disk space.

Statistical properties of terms: Heap's law (Estimating the number of terms) and Zipf's law (Modeling the distribution of terms: cf ~ 1/i)

Dictionary Compression: Fixed-width, Dictionary as a string (storing the dictionary terms as one long string), Blocked storage (eliminate k−1 term pointers, but need an additional k bytes for storing the length of each term; time consuming) and Blocking & Front coding.

Postings File Compression (using fewer bits for short gaps): Variable byte codes (the first bit of the byte: a continuation bit (1 for the last byte)) and R codes (unary code, gap into length and offset).

Monday, January 13, 2014

Unit 2 Muddiest Points (1/13)

What's the difference between "white_house", "white-house" and "white house" when indexing? Which way is the best to present the index as phrase?

Thursday, January 9, 2014

Unit 2 Reading Note (1/13)

IIR Section 1.2

Step of building index: Collect the documents to be indexed; Tokenize the text, turning each document into a list of tokens; Producing a list of normalized tokens; Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

Data structure: linked lists (allow cheap insertion) or variable length arrays (avoid overhead for pointers)

IIR Chapter 2

Tokenization is the process of chopping character streams into tokens, while linguistic prepro- cessing then deals with building equivalence classes of tokens which are the set of terms that are indexed.

The problems with large document units can be alleviated by use of explicit or implicit proximity search.

Tokenization: token, type, term.

The general trend in IR systems: no stop list (stop words).

Normalization: mapping rules and maintaining relations between unnormalized tokens.

The increased flexibility that comes from distinct postings lists is appealing.

Lowercasing everything often remains the most practical solution.

Solution to foreign or complex words: to use heuristics to equiva- lence class or expand terms with phonetic equivalents.

Stemming and Lemmatization.

Faster postings list intersection are archived by skip pointers. The presence of skip pointers only helps for AND queries, not for OR queries.

Phrase queries are supported by Biword indexes (expend the size of vocabulary) and Positional indexes (check positions of terms).

Good queries to include in the phrase index are ones known to be common based on re- cent querying behavior. But this is not the only criterion: the most expensive phrase queries to evaluate are ones where the individual words are com- mon but the desired phrase is comparatively rare.

IIR Chapter 3

Dictionary solutions: hashing and search trees.

Wildcard query handling: Permuterm indexes ($ to mark the end) and k-gram indexes.

Isolated-term correction: edit distance (+permuterm index) and k-gram overlap (Jaccard coefficient, +edit distance).

Phonetic correction algorithms: Soundex Algorithms

Unit 1 Muddiest Points (1/6)

For the search of data in un-relational database, like the data in MongoDB (a kind of un-relational database), what kind of search is it belonged to? IR or Database Search?

Unit 1 Reading Note (1/6)

1. FOA section 1.1

Founding Out About of a topic of special interest need looking for things are relevant and enables us know the meaning of things which is the semantics of words, sentences, questions and documents involved.

The basic process of FOA (Ask, Answer and Assess) would be specific to each component of search engine. And the fundamental operation of search engine is a match between description (queries) by users and documents (corpus).

2. IES section 1.1 and 1.2

IR: represent, search, and manipulate large collections of electronic text and other human-language data.

Relevance ranking is a core problem in IR <- RSV (Retrieval Status Value), Probability Ranking Principle (PRP)

A major task of a search engine is to maintain and manipulate an inverted index for a document collection.

Search result presenting: document (self-contained unit), elements (pages/paragraphs) or snippets (text passages/video segments)

Performance Evaluation: efficiency, effectiveness, specificity, exhaustivity and novelty.

3. MIR sections 1.1-1.4

The IR Problem: the primary goal of an IR system is to retrieve all the documents that are relevant to a user query while retrieving as few non-relevant documents as possible.

Web effects to search are from: the characteristics of the document collection itself;  the size of the collection and the volume of user queries submitted on a daily basis; the vast size of the document collection; the fact that the Web is not just a repository of documents and data, but also a medium to do business; Web advertising and other economic incentives.

Practical Issues: security, privacy, copyright and patent rights.