Thursday, February 27, 2014

Unit 8 Reading Note (3/3)

MIR chapter 10

Human-Computer Interaction Design Principles:
1, Offer informative feedback
2, Reduce working memory load
3, Provide alternative interfaces for novice and expert users

Marti A. Hearst. Ch. 1: The Design Of Search User Interfaces. Search User Interfaces.

Design of interface: offer informative feedback; support user control; reduce short-term memory load; provide shortcuts for skilled users; reduce errors, offer simple error handling; strive for consistency; permit easy reversal of actions; design for closure.

Marti A. Hearst. Ch. 11: Information Visualization For Text Analysis. Search User Interfaces.

Visualization is a promising tool for the analysis and understanding of text collections, including semi-structured text as found in citation collections, and for applications such as literary analysis. Visualization has also been applied to online conversations and other forms of social interaction which have textual components. It is likely that the use of visualization for analysis of text will only continue to grow in popularity.

Unit 7 Muddiest Points (2/24)

As BIM is based on whether a document is relevant or non relevant , does the BM25 get relevance feedback from BIM?

Thursday, February 20, 2014

Unit 7 Reading Note (2/24)

IIR Chapter 9

Query Refinement

--Local: Rocchio algorithm (relevance feedback: the optimal query is the vector difference between the centroids of the relevant and nonrelevant documents); Probabilistic relevance feedback (Naive Bayes probabilistic model); when to work (make query close to document; relevant documents are clustered); Pseudo relevance feedback (assume top k are relevant w/o interaction); indirect relevance feedback (DirectHit).

--Global: vocabulary tools; query expansion (by thesaurus (automatic generated by exploiting word concurrence and grammatical analysis)).

Unit 6 Muddiest Points (2/17)

Since the Binary Independence Model (BIM) can incorporate relevance feedback, is it a document likelihood model?

Friday, February 14, 2014

Unit 6 Reading Note (2/17)

IIR Chapter 8

Evaluation of unranked retrieval sets: Precision (1-specificity), Recall (sensitivity), F measure.
Evaluation of ranked retrieval sets: interpolated precision, MAP (Mean Average Precision), Precision at k ->R precision, ROC curve and cumulative gain.

Assessing relevance: kappa statistic (agreement between judges)

Marginal relevance: whether a document still has distinctive usefulness after the user has looked at certain other documents.

Monday, February 10, 2014

Unit 5 Muddiest Points (2/10)

Is BIM (Binary Independence Model) belongs to Document likelihood model?

Friday, February 7, 2014

Unit 5 Reading Note (2/10)

IIR Chapter 11

Binary Independence Model (BIM) assumption: the presence or absence of a word in a document is independent of the presence or absence of any other word (given the query); terms not occurring in the query are equally likely to occur in relevant and nonrelevant doc- uments: that is, if qt = 0 then pt = ut

Retrieval Status Value (RSV) estimate: ut = udf; pt (fixed size set V)

Chapter 12

Language Model (LM): query likelihood model: P(d|q) P(d) ((1 λ)P(t|Mc) + λP(t|Md))

Different from BIM,  LM approach does away with explicitly modeling relevance.

Query language Model (BIM) and document language model (LM) combination: Kullback-Leibler (KL) divergence.

Translation Model: P(q|Md) = ∏ ∑ P(v|Md)T(t|v)

Relating the New Language Models of Information Retrieval to the Traditional Retrieval Models

LM shares some some characteristics with VS (vector space) and BIM: justification for using tf.idf weights and new relevance weighting method (terms can be assigned a zero relevance weight; two steps until the value of relevance weight does not change)

Extended Boolean retrieval: the probability of the disjunction of m possible translations;easy add term/collection frequencies by OR ; "grouping" by OR; disjunction into conjunctive norm form.

LM outperforms VS, BIM and Boolean Model.

Monday, February 3, 2014

Unit 4 Muddiest Points (2/3)

1, In using heap for selecting top K, why are construction of heap taken 2J operations and top k read in 2logJ steps? As we know construction of heap need take nlogn operations.

2, Why "Bags of words" can be represented as vectors? Such as "Mary is quicker than John" and "John is quicker than Mary" are similar in vector representation I think.