ECE448/CS440 Artificial Intelligence¶

约 880 个字 1 张图片预计阅读时间 3 分钟

[!IMPORTANT]

https://courses.grainger.illinois.edu/cs440/fa2025/lectures/probability-review.html

Introduction¶

https://courses.grainger.illinois.edu/cs440/fa2025/lectures/intro.html

Historical and other trivia¶

We've seen a lot of trivia, most of it not worth memorizing. The following items are the exceptions. Be able to explain (very briefly) what they are and (approximately) what time period they come from.

McCulloch and Pitts
early neural nets, 1940's
developed on-paper models that were insightful, but they couldn't actually implement them
Fred Jelinek
1980's-1990's
started to get speech recognition working usefully using statistical ngram models
Pantel and Lin (SpamCop)
Boulis and Ostendorf
The Plato System

Probability¶

https://courses.grainger.illinois.edu/cs440/fa2025/lectures/probability-review.html

Random variables 随机变量

Kolmogorov's axioms of probability 柯尔莫哥洛夫概率公理 $$ 0 ≤P(A)\ P(True) = 1\ P(A|B) = P(A) + P(B), \text{if A and B are mutually exclusive events} $$ Joint, marginal probability 联合/边缘概率

$P(X=x)=\sum_{k=1}^{n}P(x,y^k)$
所有概率加起来=1

Conditional probability

$P(A | C) = \frac{P(A,C)}{P(C)}$

Independence

Two events A and B are independent iff $P(A,B) = P(A) * P(B)$
It's equivalent to show that this equation is equivalent to each of the following equations: $$ P(A | B) = P(A)\ P(B | A) = P(B) $$

Metrics for Evaluating a Classifier¶

	Labels from Algorithm
	Cancer	Not Cancer
Correct = Cancer	True Positive (TP)	False Negative (FN)
Correct = Not Cancer	False Positive (FP)	True Negative (TN)

False positive rate = FP/(FP+TN) [how many wrong things are in the negative outputs]
False negative rate = FN/(TP+FN) [how many wrong things are in the positive outputs]
Accuracy = (TP+TN)/(TP+TN+FP+FN)
Error rate = 1-accuracy
precision (p) = TP/(TP+FP) [how many of our outputs were correct?]
recall ® = TP/(TP+FN) [how many of the correct answers did we find?]
F1 = 2pr/(p+r)
F1 is the harmonic mean of precision and recall. Both recall and precision need to be good to get a high F1 value.

Modelling text data¶

Word types vs. word tokens¶

word type (unique word): a dictionary entry such as "cat" 词典条目
word token: a word in a specific position in the text 文本中特定位置的单词

The Bag of Words model¶

use the individual words as features
A bag-of-words model determines the class of a document based on the frequency of occurrence of each word. It ignores the order in which words occur, which ones occur together, etc. So it will miss some details, e.g. the difference between "poisonous" and "not poisonous."
Bigrams, ngrams
Data cleaning:
tokenization
stemming (including Julie Lovins and Martin Porter)
making units of useful size: dividing words or grouping characters

Special types of words and how we might handle them¶

stop words
rare words
hapax legomena
filler
backchannel
function vs. content

Testing¶

Roles of training, development, test datasets.
Evaluation metrics for classification (true positive rate, accuracy, recall, confusion matrix, ...)

Naive Bayes¶

https://courses.grainger.illinois.edu/cs440/fa2025/lectures/bayes.html

Basic definitions and mathematical model:

Bayes rule¶

P(A | C) is the probability of A in a context where C is true

Definition of conditional probability: $P(A | C) = \frac{P(A,C)}{P(C)}$
$P(A,C) = P(A) \times P(C | A) = P(C,A)=P(C) \times P(A | C)$
Bayes Rule: $P(C|A)=\frac{P(A|C)\times P(C)}{P(A)}$
P(cause|evidence)=P(evidence|cause)*P(cause)/P(evidence)
posterior likelihood prior normalization
Likelihood, prior, posterior
argmax operator
Independence and conditional independence
Maximum a posteriori (MAP) estimate, Maximum likelihood (ML) estimate, factoring out P(evidence)

Maximum a posteriori (MAP) estimate¶

MAP estimate chooses the type with the highest posterior probability: pick the type $X$ such that $P(X | evidence)$ is highest

Ignoring the normalizing factor¶

$P(evidence)$ can usually be factored out of 分解 these equations, because we are only trying to determine the relative probabilities. For the MAP estimate, in fact, we are only trying to find the largest probability

$P(evidence)$ is the same for all the causes we are considering. Said another way, $P(evidence)$ is the probability that we would see this evidence if we did a lot of observations.

Then, $P(cause | evidence) ∝ P(evidence | cause) * P(cause)$

The MLE estimate¶

If we know that all causes are equally likely, we can set all $P(cause)$ to the same value for all causes. In that case, we have $P(cause | evidence) ∝ P(evidence | cause)$

How does prior affect these estimates?
How do we combine several conditionally independent pieces of evidence into one estimate of P(cause | evidence)?
How do we choose the best value for the cause/class?
How does the size of a Naive Bayes model compare to a full joint distribution?
Why does it matter that Naive Bayes reduces the number of parameters we need to estimate?

Applying Naive Bayes to text classification¶

MAP and MLE versions of the estimation equations
Estimating probabilities from data
Avoiding underflow (log transforms)
Avoiding overfitting

Smoothing¶

Why is it important?
Zeroes destroy the Naive Bayes algorithm. We use "smoothing" algorithms to assign non-zero probabilities to unseen words.
Laplace smoothing 拉普拉斯平滑
Deleted estimation
Ngram smoothing (high level ideas only)
Headline results: spam detection (SpamCop, Pantel and Lin), gender classification (Boulis and Ostendorf)

评论区~

有用的话请给我个赞和 star =>