[試題] 106-2 鄭卜壬 網路資訊檢索與探勘 期中考

作者: xavier13540 (柊 四千)   2025-03-27 04:09:47
課程名稱︰網路資訊檢索與探勘
課程性質︰資工系選修
課程教師︰鄭卜壬
開課學院:電機資訊學院
開課系所︰資訊工程學系
考試日期(年月日)︰2018/04/27
考試時限(分鐘):180
試題 :
1. (26 pts) Two human judges used the pooling method to evaluate the performance
of ten information retrieval (IR) systems. The following table shows how they
rated the relevance of a collected pool of 20 documents to a certian query
topic, in which R indicates relevance and N indicates non-relevance. Suppose
that a document is considered relevant only if the two judges agreed together
in the evaluation.
Doc ID│ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
────┼──────────────────────────────
Judge 1│ N N R R N R N R N N N N N R R R R N N R
Judge 2│ N N R R N R R R N R N N N N R R N N N N
★ Here are the top 10 ranking lists returned by two of the ten systems,
NTU-1 and NTU-2, respectively, for this query topic. Please answer the fol-
lowing questions.
System│ Rank│ 1 2 3 4 5 6 7 8 9 10
───┼───┼───────────────
NTU-1│Doc ID│ 17 3 12 16 8 6 19 20 15 10
NTU-2│Doc ID│ 4 17 3 11 13 16 7 15 6 8
(a) (3 pts) Explain why pooling is shown to be a valid and pratical method
even if we cannot exhaust the annotation of all relevant documents.
(b) (3 pts) Calculate the kappa measure between the two judges.
(c) (3 pts) Does increasing recall always reduce precision? Give an example
to explain your answer.
★ Mean Average Precision (MAP), Precision at 3 (P@3), Recall, and Mean Reci-
procal Rank (MRR) are common single-figure measures of retrieval quality. In
each of the following tasks (d)~(g), which measure is the most appropriate
for performance evaluation? Based on your choice, which system performs bet-
ter? Show your calculations for both systems. Assume that there are 100 docu-
ments in the collection.
(d) (3 pts) The ad-hoc IR task.
(e) (3 pts) The patent retrieval task.
(f) (3 pts) The Web retrieval task.
(g) (3 pts) The question-answering task.
★ Accuracy is used to calculate the fraction of classifications that are
correct.
(h) (2 pts) What is the meaning of "false positive" in terms of IR?
(i) (3 pts) Judge if accuracy is a good measure for the ad-hoc IR task. Why?
2. (26 pts) Vector space model (VSM) is an algebraic model for representing
documents as vectors of index terms.
★ Several variants of term-weighting for VSM have been developed.
(a) (4 pts) The logarithm function is often used for calculating some
weights. Give one example formula for such weight. Explain the rationale
behind the usage of logarithm as clearly as possible.
(b) (5 pts) Here is the way to transform term frequency (TF) in Okapi BM25:
\[\frac{(k+1) \cdot TF}{k + TF} (k \text{is a non-negative number})\]
What's the meaning of parameter k? Discuss the cases where k = 0 and
k = ∞.
What's the upper bound of the transformed TF?
Draw a figure to show the relationship between original TF and trans-
formed TF.
★ Relevance feedback provides VSM useful information about "what is relevant
or not."
(c) (3 pts) Explain why pseudo relevance feedback might produce worse
results.
(d) (3 pts) Explain why the Rocchio algorithm might also lead to worse
results.
★ VSM assumes sematic independence of terms in its basis. Latent Semantic
Indexing (LSI) is helpful in alleviating the term-mismatching problem.
(e) (3 pts) In LSI, does increasing the dimension (i.e., the number of con-
cepts) of latent space always improve recall? Why?
★ Consider a word-document matrix consisting of words $w_1..w_3$ and docu-
ments $d_1..d_4$. SVD of the matrix is performed as follows:
d_1 d_2 d_3 d_4
w_1 [ 5 3 0 1] [ .2 .8 -.6][12.3 .0 .0 .0][ .2 .1 .6 .7]
w_2 [ 3 2 2 6] = [ .5 .4 .7][ .0 6.7 .0 .0][ .8 .5 -.4 .0]
w_3 [ 0 0 8 7] [ .8 -.4 -.4][ .0 .0 2.1 .0][-.3 -.1 -.7 .7]
[ .0 .0 .0 .0]
(f) (5 pts) Conpare the similarity between $d_1$ and $d_2$ with the simila-
rity between $d_1$ and $d_4$ by computing their inner product in the ori-
ginal space and in the latent space (with only the two most important
latent concepts, i.e., rank-2 (k = 2) approximation), respectively. Which
is more reasonable? Show your calculation. Do NOT reconstruct original
matrix here.
(g) (3 pts) Compute the reconstructed version of document $d_2$ using only
the two most important latent concepts, i.e., rank-2 (k = 2) approxi-
mation.
3. (27 pts) Language model (LM) is to estimate the probability of a sequence of
words.
(a) (4 pts) Under what circumstance is the query likelihood model, ranking by
p(q|d), equivalent to ranking by p(d|q)? Give an IR application in which
p(d|q) is different from p(q|d).
(b) (4 pts) Compare the query likelihood model with the document likelihood
model. Which one could more likely be worse estimated? Why?
(c) (4 pts) Compare the difference between the way to smooth a query LM and
the way to smooth a document LM.
(d) (4 pts) What is Probability Ranking Principle (PRP)? Can the query like-
lihood model be justified by PRP? Explain your answer.
(e) (3 pts) Suppose query q has n words, i.e., $q = w_1 \ldots w_n$. Develop
a bi-gram LM for p(q|d), which is smoothed with a uni-gram LM. Write down
your formula.
(f) (8 pts) Given a document collection D with a vocabulary of $w_1, \ldots,
w_6$, you are asked to rank two documents $d_1$ and $d_2$ based on query
likelihood as follows.
\[p(q|d) = \prod_{w_i \in q} [\lambda p(w_i|d) + (1-\lambda) p(w_i|D)],\]
where q and d stand for query and document, respectively. The following
table shows the statistical information about the word counts $w_i$ for
$d_1$, $d_2$ and D. Please give an example query such as $q = w_2 w_3$ to
show that ranking with smoothing is more reasonable than ranking without
smoothing. Explain your answer by calculating $p(q|d_1)$ and $p(q|d_2)$.
Word count│ $d_1$│ $d_2$│ D
─────┼───┼───┼───
$w_1$│ 2│ 7│ 8000
$w_2$│ 0│ 1│ 100
$w_3$│ 3│ 1│ 1000
$w_4$│ 1│ 1│ 400
$w_5$│ 1│ 0│ 200
$w_6$│ 3│ 0│ 300
Sum│ 10│ 10│ 10000
4. (8 pts) The general form for Zipf's law is $r \times p(w_r|C) = 0.1$, where r
is the rank of a word in the descending order of frequency. $w_r$ is the word
at rank r, and $p(w_r|C)$ is the probability (frequency) of word $w_r$.
(a) (4 pts) What is the fewest number of most frequent words that together
account for more than 20% of word occurrences? Show the calculation.
(b) (4 pts) Which strategy is more effective for recuding the size of an in-
verted index:
Strategy A: removing low-frequency words
Strategy B: removing high-frequency words
if (a) Zipf's law is considered or (b) postings list compression is con-
sidered?
5. (13 pts) When modeling documents with multivariate Bernoulli distributions,
we represent document $d_k$ as a binary vector indicating whether a word
occurs or not in $d_k$. More specifically, given vocabulary $V = \{w_1,
\ldots, w_M\}$ with M words, document $d_k$ is represented as $d_k = (x_1,
x_2, \ldots, x_M)$, where $x_i$ is either 0 or 1. $x_i = 1$ if word $w_i$ can
be observed in $d_k$; otherwise, $x_i = 0$. Assume that there are totally N
documents in corpus $C = \{d_1, \ldots, d_N\}$, i.e., k = 1..N. We want to
model the N documents with a mixture model with two multivariate Bernoulli
distributions $\theta_1$ and $\theta_2$. Each component $\theta_j (j = 1..2)$
has M parameters $\{p(w_i=1|\theta_j)\} (i = 1..M)$, where $p(w_i=1|
\theta_j)$ means the probability that $w_i$ would show up when using
$\theta_j$ to generate a document. Similarly, $p(w_i=0|\theta_j)$ means the
probability that $w_i$ would not show up when using $\theta_j$ to generate a
document. $p(w_i=1|\theta_j) + p(w_i=0|\theta_j) = 1$. Suppose we choose
$\theta_1$ with probability $\lambda_1$ and $\theta_2$ with probability
$\lambda_2$. $\lambda_1 + \lambda_2 = 1$.
(a) (5 pts) Please define the log-likelihood function for $p(d_k|\theta_1
+\theta_2)$ given such a two-component mixture model.
(b) (8 pts) Suppose we know $p(w_i=1|\theta_1)$ and $p(w_i=1|\theta_2)$.
Write down the E-step and M-step formulas for estimating $\lambda_1$ and
$\lambda_2$. Explain your formulas.
以下本人附註
1. 部分排版困難的方程式用tex語法呈現
2. 由於即使用tex語法 矩陣乘法看起來也不會好到哪裡去
麻煩大家用等寬字型瀏覽了[rdrrC]

Links booklink

Contact Us: admin [ a t ] ucptt.com