ECE448/CS440 Artificial Intelligence¶
约 880 个字 1 张图片 预计阅读时间 3 分钟
CS 440 Artificial Intelligence
https://github.com/illinois-cs-coursework
https://courses.grainger.illinois.edu/cs440/fa2025/readings.html
[!IMPORTANT]
https://courses.grainger.illinois.edu/cs440/fa2025/lectures/probability-review.html
Introduction¶
https://courses.grainger.illinois.edu/cs440/fa2025/lectures/intro.html
Historical and other trivia¶
We've seen a lot of trivia, most of it not worth memorizing. The following items are the exceptions. Be able to explain (very briefly) what they are and (approximately) what time period they come from.
- McCulloch and Pitts
- early neural nets, 1940's
-
developed on-paper models that were insightful, but they couldn't actually implement them
-
Fred Jelinek
- 1980's-1990's
-
started to get speech recognition working usefully using statistical ngram models
-
Pantel and Lin (SpamCop)
- Boulis and Ostendorf
- The Plato System
Probability¶
https://courses.grainger.illinois.edu/cs440/fa2025/lectures/probability-review.html
Random variables 随机变量
Kolmogorov's axioms of probability 柯尔莫哥洛夫概率公理 $$ 0 ≤P(A)\ P(True) = 1\ P(A|B) = P(A) + P(B), \text{if A and B are mutually exclusive events} $$ Joint, marginal probability 联合/边缘概率
- \(P(X=x)=\sum_{k=1}^{n}P(x,y^k)\)
- 所有概率加起来=1
Conditional probability
- \(P(A | C) = \frac{P(A,C)}{P(C)}\)
Independence
-
Two events A and B are independent iff \(P(A,B) = P(A) * P(B)\)
-
It's equivalent to show that this equation is equivalent to each of the following equations: $$ P(A | B) = P(A)\ P(B | A) = P(B) $$
Metrics for Evaluating a Classifier¶
Labels from Algorithm | ||
---|---|---|
Cancer | Not Cancer | |
Correct = Cancer | True Positive (TP) | False Negative (FN) |
Correct = Not Cancer | False Positive (FP) | True Negative (TN) |
- False positive rate = FP/(FP+TN) [how many wrong things are in the negative outputs]
- False negative rate = FN/(TP+FN) [how many wrong things are in the positive outputs]
- Accuracy = (TP+TN)/(TP+TN+FP+FN)
- Error rate = 1-accuracy
- precision (p) = TP/(TP+FP) [how many of our outputs were correct?]
- recall ® = TP/(TP+FN) [how many of the correct answers did we find?]
- F1 = 2pr/(p+r)
- F1 is the harmonic mean of precision and recall. Both recall and precision need to be good to get a high F1 value.
Modelling text data¶
Word types vs. word tokens¶
- word type (unique word): a dictionary entry such as "cat" 词典条目
- word token: a word in a specific position in the text 文本中特定位置的单词
The Bag of Words model¶
- use the individual words as features
-
A bag-of-words model determines the class of a document based on the frequency of occurrence of each word. It ignores the order in which words occur, which ones occur together, etc. So it will miss some details, e.g. the difference between "poisonous" and "not poisonous."
-
Bigrams, ngrams
- Data cleaning:
- tokenization
- stemming (including Julie Lovins and Martin Porter)
- making units of useful size: dividing words or grouping characters
Special types of words and how we might handle them¶
- stop words
- rare words
- hapax legomena
- filler
- backchannel
- function vs. content
Testing¶
- Roles of training, development, test datasets.
- Evaluation metrics for classification (true positive rate, accuracy, recall, confusion matrix, ...)
Naive Bayes¶
https://courses.grainger.illinois.edu/cs440/fa2025/lectures/bayes.html
Basic definitions and mathematical model:
Bayes rule¶
P(A | C) is the probability of A in a context where C is true
- Definition of conditional probability: \(P(A | C) = \frac{P(A,C)}{P(C)}\)
- \(P(A,C) = P(A) \times P(C | A) = P(C,A)=P(C) \times P(A | C)\)
- Bayes Rule: \(P(C|A)=\frac{P(A|C)\times P(C)}{P(A)}\)
- P(cause|evidence)=P(evidence|cause)*P(cause)/P(evidence)
-
posterior likelihood prior normalization
-
Likelihood, prior, posterior
- argmax operator
- Independence and conditional independence
- Maximum a posteriori (MAP) estimate, Maximum likelihood (ML) estimate, factoring out P(evidence)
Maximum a posteriori (MAP) estimate¶
MAP estimate chooses the type with the highest posterior probability: pick the type \(X\) such that \(P(X | evidence)\) is highest
Ignoring the normalizing factor¶
\(P(evidence)\) can usually be factored out of 分解 these equations, because we are only trying to determine the relative probabilities. For the MAP estimate, in fact, we are only trying to find the largest probability
\(P(evidence)\) is the same for all the causes we are considering. Said another way, \(P(evidence)\) is the probability that we would see this evidence if we did a lot of observations.
Then, \(P(cause | evidence) ∝ P(evidence | cause) * P(cause)\)
The MLE estimate¶
If we know that all causes are equally likely, we can set all \(P(cause)\) to the same value for all causes. In that case, we have \(P(cause | evidence) ∝ P(evidence | cause)\)
- How does prior affect these estimates?
- How do we combine several conditionally independent pieces of evidence into one estimate of P(cause | evidence)?
- How do we choose the best value for the cause/class?
- How does the size of a Naive Bayes model compare to a full joint distribution?
- Why does it matter that Naive Bayes reduces the number of parameters we need to estimate?
Applying Naive Bayes to text classification¶
- MAP and MLE versions of the estimation equations
- Estimating probabilities from data
- Avoiding underflow (log transforms)
- Avoiding overfitting
Smoothing¶
- Why is it important?
- Zeroes destroy the Naive Bayes algorithm. We use "smoothing" algorithms to assign non-zero probabilities to unseen words.
- Laplace smoothing 拉普拉斯平滑
- Deleted estimation
-
Ngram smoothing (high level ideas only)
-
Headline results: spam detection (SpamCop, Pantel and Lin), gender classification (Boulis and Ostendorf)