Friday, March 28, 2014

Unit 11 Reading Note (3/31)

IES chapter 14 parallel information retrieval

Document Partitioning: each document is divided into one or more nonoverlapping partitions. Many of the text-framework features can be configured to operate differently for each partition. But it doesn't have good performance if the index is stored on disk.

Term partitioning addresses the disk seek problem by splitting the collection into sets of terms instead of sets of documents.

MapReduces are highly parallelizable, because both map and reduce can be executed in parallel on many different machines.

Unit 10 Muddiest Points (3/24)

I have no muddiest points this week.

Friday, March 21, 2014

Unit 10 Reading Note (3/24)

IIR chapters 19 and 21

In response to queries a search engine can return web pages whose contents it has not indexed.

Search engines generally organize their indexes in various tiers and parti- tions, not all of which are examined on every search.

In a Markov chain, the probability distribution of next states for a Markov chain depends only on the current state, and not on how the Markov chain arrived at the current state.

Thursday, February 27, 2014

Unit 8 Reading Note (3/3)

MIR chapter 10

Human-Computer Interaction Design Principles:
1, Offer informative feedback
2, Reduce working memory load
3, Provide alternative interfaces for novice and expert users

Marti A. Hearst. Ch. 1: The Design Of Search User Interfaces. Search User Interfaces.

Design of interface: offer informative feedback; support user control; reduce short-term memory load; provide shortcuts for skilled users; reduce errors, offer simple error handling; strive for consistency; permit easy reversal of actions; design for closure.

Marti A. Hearst. Ch. 11: Information Visualization For Text Analysis. Search User Interfaces.

Visualization is a promising tool for the analysis and understanding of text collections, including semi-structured text as found in citation collections, and for applications such as literary analysis. Visualization has also been applied to online conversations and other forms of social interaction which have textual components. It is likely that the use of visualization for analysis of text will only continue to grow in popularity.

Unit 7 Muddiest Points (2/24)

As BIM is based on whether a document is relevant or non relevant , does the BM25 get relevance feedback from BIM?

Thursday, February 20, 2014

Unit 7 Reading Note (2/24)

IIR Chapter 9

Query Refinement

--Local: Rocchio algorithm (relevance feedback: the optimal query is the vector difference between the centroids of the relevant and nonrelevant documents); Probabilistic relevance feedback (Naive Bayes probabilistic model); when to work (make query close to document; relevant documents are clustered); Pseudo relevance feedback (assume top k are relevant w/o interaction); indirect relevance feedback (DirectHit).

--Global: vocabulary tools; query expansion (by thesaurus (automatic generated by exploiting word concurrence and grammatical analysis)).

Unit 6 Muddiest Points (2/17)

Since the Binary Independence Model (BIM) can incorporate relevance feedback, is it a document likelihood model?

Friday, February 14, 2014

Unit 6 Reading Note (2/17)

IIR Chapter 8

Evaluation of unranked retrieval sets: Precision (1-specificity), Recall (sensitivity), F measure.
Evaluation of ranked retrieval sets: interpolated precision, MAP (Mean Average Precision), Precision at k ->R precision, ROC curve and cumulative gain.

Assessing relevance: kappa statistic (agreement between judges)

Marginal relevance: whether a document still has distinctive usefulness after the user has looked at certain other documents.

Monday, February 10, 2014

Unit 5 Muddiest Points (2/10)

Is BIM (Binary Independence Model) belongs to Document likelihood model?

Friday, February 7, 2014

Unit 5 Reading Note (2/10)

IIR Chapter 11

Binary Independence Model (BIM) assumption: the presence or absence of a word in a document is independent of the presence or absence of any other word (given the query); terms not occurring in the query are equally likely to occur in relevant and nonrelevant doc- uments: that is, if qt = 0 then pt = ut

Retrieval Status Value (RSV) estimate: ut = udf; pt (fixed size set V)

Chapter 12

Language Model (LM): query likelihood model: P(d|q) P(d) ((1 λ)P(t|Mc) + λP(t|Md))

Different from BIM,  LM approach does away with explicitly modeling relevance.

Query language Model (BIM) and document language model (LM) combination: Kullback-Leibler (KL) divergence.

Translation Model: P(q|Md) = ∏ ∑ P(v|Md)T(t|v)

Relating the New Language Models of Information Retrieval to the Traditional Retrieval Models

LM shares some some characteristics with VS (vector space) and BIM: justification for using tf.idf weights and new relevance weighting method (terms can be assigned a zero relevance weight; two steps until the value of relevance weight does not change)

Extended Boolean retrieval: the probability of the disjunction of m possible translations;easy add term/collection frequencies by OR ; "grouping" by OR; disjunction into conjunctive norm form.

LM outperforms VS, BIM and Boolean Model.

Monday, February 3, 2014

Unit 4 Muddiest Points (2/3)

1, In using heap for selecting top K, why are construction of heap taken 2J operations and top k read in 2logJ steps? As we know construction of heap need take nlogn operations.

2, Why "Bags of words" can be represented as vectors? Such as "Mary is quicker than John" and "John is quicker than Mary" are similar in vector representation I think.

Thursday, January 30, 2014

Unit 4 Reading Note (2/3)

IIR 1.3 and 1.4

Processing Boolean queries: intersection by merge algorithm
Proximity operator (query closeness)

Chapter 6

Scoring zones and learning weights (machine-learned relevance; weight g determined by enable error equals 0)

Term frequency and weight: tf-idf weighting; tf: term frequency; idf: inverse document frequency (rare term is high)

-->Vector space model for scoring: dot products (weights of terms; cosine similarity) -->queries as vectors (v(q).v(d))

Variant tf-idf functions (alternatives): sublinear tf scaling and maximum tf normalization; scheme (ddd.qqq eg. lnc.ltc: log-weighted tf, no idf and cosine normalizaition (document); log-weighted tf, idf and cn (query))

Pivoted normalized document length (enable dot product score account for the effect of document length on relevance): a normalization factor for each document vector that is not the Euclidean length of that vector, but instead one that is larger than the Euclidean length for documents of length less than lp, and smaller for longer documents.

Monday, January 27, 2014

Unit 3 Muddiest Points (1/27)

During the merge process of dictionary, are the postings merged automatically with terms or dealt separately?

Thursday, January 23, 2014

Unit 3 Reading Note (1/27)

IIR Chapter 4 - Index Construction

Hardware basics: access memory is faster than disk (caching); Seek time (the time to locate data on disk); Buffer (the part of main memory where a block being read or written).

Blocked sort-based indexing (termID-docID pairs) uses external sorting algorithm: sort the pairs, then collect all termID-docID pairs with the same termID into a postings list. O(TlogT).

Single-pass in-memory indexing (term-docID pairs) adds a posting directly to its postings list. Fast and memory saving. O(T).

Distributed indexing: MapReduce by master node; parses partition keys into terms and inverters collect all values for a given key into one list.

Dynamic indexing (reconstruct the index from scratch) solution: maintain two indexes--a large main index and a small auxiliary index that stores new documents. Logarithmic merging O(Tlog (T/n)).

Chapter 5 - Index Compression

Benefits of index compression: uncreased use of caching; faster transfer of data from disk to memory; conserve disk space.

Statistical properties of terms: Heap's law (Estimating the number of terms) and Zipf's law (Modeling the distribution of terms: cf ~ 1/i)

Dictionary Compression: Fixed-width, Dictionary as a string (storing the dictionary terms as one long string), Blocked storage (eliminate k−1 term pointers, but need an additional k bytes for storing the length of each term; time consuming) and Blocking & Front coding.

Postings File Compression (using fewer bits for short gaps): Variable byte codes (the first bit of the byte: a continuation bit (1 for the last byte)) and R codes (unary code, gap into length and offset).

Monday, January 13, 2014

Unit 2 Muddiest Points (1/13)

What's the difference between "white_house", "white-house" and "white house" when indexing? Which way is the best to present the index as phrase?

Thursday, January 9, 2014

Unit 2 Reading Note (1/13)

IIR Section 1.2

Step of building index: Collect the documents to be indexed; Tokenize the text, turning each document into a list of tokens; Producing a list of normalized tokens; Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

Data structure: linked lists (allow cheap insertion) or variable length arrays (avoid overhead for pointers)

IIR Chapter 2

Tokenization is the process of chopping character streams into tokens, while linguistic prepro- cessing then deals with building equivalence classes of tokens which are the set of terms that are indexed.

The problems with large document units can be alleviated by use of explicit or implicit proximity search.

Tokenization: token, type, term.

The general trend in IR systems: no stop list (stop words).

Normalization: mapping rules and maintaining relations between unnormalized tokens.

The increased flexibility that comes from distinct postings lists is appealing.

Lowercasing everything often remains the most practical solution.

Solution to foreign or complex words: to use heuristics to equiva- lence class or expand terms with phonetic equivalents.

Stemming and Lemmatization.

Faster postings list intersection are archived by skip pointers. The presence of skip pointers only helps for AND queries, not for OR queries.

Phrase queries are supported by Biword indexes (expend the size of vocabulary) and Positional indexes (check positions of terms).

Good queries to include in the phrase index are ones known to be common based on re- cent querying behavior. But this is not the only criterion: the most expensive phrase queries to evaluate are ones where the individual words are com- mon but the desired phrase is comparatively rare.

IIR Chapter 3

Dictionary solutions: hashing and search trees.

Wildcard query handling: Permuterm indexes ($ to mark the end) and k-gram indexes.

Isolated-term correction: edit distance (+permuterm index) and k-gram overlap (Jaccard coefficient, +edit distance).

Phonetic correction algorithms: Soundex Algorithms

Unit 1 Muddiest Points (1/6)

For the search of data in un-relational database, like the data in MongoDB (a kind of un-relational database), what kind of search is it belonged to? IR or Database Search?

Unit 1 Reading Note (1/6)

1. FOA section 1.1

Founding Out About of a topic of special interest need looking for things are relevant and enables us know the meaning of things which is the semantics of words, sentences, questions and documents involved.

The basic process of FOA (Ask, Answer and Assess) would be specific to each component of search engine. And the fundamental operation of search engine is a match between description (queries) by users and documents (corpus).

2. IES section 1.1 and 1.2

IR: represent, search, and manipulate large collections of electronic text and other human-language data.

Relevance ranking is a core problem in IR <- RSV (Retrieval Status Value), Probability Ranking Principle (PRP)

A major task of a search engine is to maintain and manipulate an inverted index for a document collection.

Search result presenting: document (self-contained unit), elements (pages/paragraphs) or snippets (text passages/video segments)

Performance Evaluation: efficiency, effectiveness, specificity, exhaustivity and novelty.

3. MIR sections 1.1-1.4

The IR Problem: the primary goal of an IR system is to retrieve all the documents that are relevant to a user query while retrieving as few non-relevant documents as possible.

Web effects to search are from: the characteristics of the document collection itself;  the size of the collection and the volume of user queries submitted on a daily basis; the vast size of the document collection; the fact that the Web is not just a repository of documents and data, but also a medium to do business; Web advertising and other economic incentives.

Practical Issues: security, privacy, copyright and patent rights.