Assigned Reading
- Overview of algorithms for natural language processing
PubMed►
- Introduction to text analysis
Slides►
-
General introduction to sentiment analysis of patient reviews Read►
- Priyal Makwana's Teach One Slides►
- Predicting who will attempt suicide
Slides►
Video►
YouTube►
- A large database of reviews of physicians Zip►
Assignment
For this assignment you can use any statistical software.
Question 1: Use
the following corpus of training data. Classify if the following sentence
is a complaint: "He loves his patients and I can tell it's about us and
not the money."
The corpus is organized as in
the following table. The comment ID shows the comment in the
training data. The columns on the right of the table show where in
the training comment the word appears. For example, in the training
comment 57685 the word patient is the third word in the comment.
comment Id |
Type Id |
Classification True = Complaint False = Praise |
loves |
patients |
tell |
about |
not |
money |
57685 |
1 |
TRUE |
0 |
3 |
0 |
2 |
0 |
0 |
57688 |
1 |
TRUE |
0 |
0 |
1 |
0 |
0 |
0 |
57703 |
1 |
TRUE |
0 |
0 |
0 |
0 |
0 |
9 |
57704 |
1 |
TRUE |
0 |
0 |
0 |
3 |
0 |
0 |
57711 |
1 |
FALSE |
0 |
8 |
0 |
0 |
0 |
0 |
57712 |
1 |
TRUE |
0 |
0 |
0 |
0 |
2 |
0 |
Predict the classification of the target phrase based on the likelihood ratio
of the longest phrase in the target phrase that matches the training corpus.
- Generate phrases of different lengths (1 through 6 words), that keep the order of
the words in the target comment but drop 1, or 2, words at a time.
Note that these phrases are of different lengths. If we want
phrases of length 2 that keep the order in the target comment, then
this code will be useful:
SELECT T.Word, T_1.Word, [T_1].[order]-[T].[order]-1 AS
WordsDropped, T.Order, T_1.Order FROM T, T AS T_1 WHERE
((([T_1].[order]-[T].[order]-1)<=2 And
([T_1].[order]-[T].[order]-1)>=0) AND ((T_1.Order)>[T].[Order])) ORDER
BY T.Order, T_1.Order;
In this code, T, and T_1, refer to the following Table that
captures the words in the target comment:
Order |
Word |
1 |
Loves |
2 |
patients |
3 |
tell |
4 |
about |
5 |
not |
6 |
money |
The result of the code is the following phrases of length of 2
words.
T.Word |
T_1.Word |
Words
Dropped |
T.Order |
T_1.Order |
Loves |
patients |
0 |
1 |
2 |
Loves |
tell |
1 |
1 |
3 |
Loves |
about |
2 |
1 |
4 |
patients |
tell |
0 |
2 |
3 |
patients |
about |
1 |
2 |
4 |
patients |
not |
2 |
2 |
5 |
tell |
about |
0 |
3 |
4 |
tell |
not |
1 |
3 |
5 |
tell |
money |
2 |
3 |
6 |
about |
not |
0 |
4 |
5 |
about |
money |
1 |
4 |
6 |
not |
money |
0 |
5 |
6 |
For phrases of length 3, the following code can be used:
SELECT T.Word, T_1.Word, T_2.Word, [T_1].[order]-[T].[order]-1 AS
WordsDropped, T.Order, T_1.Order, T_2.Order,
[T_2].[order]-[T_1].[order]-1 AS WordsDropped2 FROM T, T AS T_1, T AS
T_2 WHERE ((([T_1].[order]-[T].[order]-1)<=2 And
([T_1].[order]-[T].[order]-1)>=0) AND ((T_1.Order)>[T].[Order]) AND
(([T_2].[order]-[T_1].[order]-1)<=2 And
([T_2].[order]-[T_1].[order]-1)>=0)) ORDER BY T.Order, T_1.Order,
T_2.Order;
This code produces the following phrases of length 3:
T.Word |
T_1.Word |
T_2.Word |
Words
Dropped |
T.Order |
T_1.Order |
T_2.Order |
Words
Dropped 2 |
Loves |
patients |
tell |
0 |
1 |
2 |
3 |
0 |
Loves |
patients |
about |
0 |
1 |
2 |
4 |
1 |
Loves |
patients |
not |
0 |
1 |
2 |
5 |
2 |
Loves |
tell |
about |
1 |
1 |
3 |
4 |
0 |
Loves |
tell |
not |
1 |
1 |
3 |
5 |
1 |
Loves |
tell |
money |
1 |
1 |
3 |
6 |
2 |
Loves |
about |
not |
2 |
1 |
4 |
5 |
0 |
Loves |
about |
money |
2 |
1 |
4 |
6 |
1 |
patients |
tell |
about |
0 |
2 |
3 |
4 |
0 |
patients |
tell |
not |
0 |
2 |
3 |
5 |
1 |
patients |
tell |
money |
0 |
2 |
3 |
6 |
2 |
patients |
about |
not |
1 |
2 |
4 |
5 |
0 |
patients |
about |
money |
1 |
2 |
4 |
6 |
1 |
patients |
not |
money |
2 |
2 |
5 |
6 |
0 |
tell |
about |
not |
0 |
3 |
4 |
5 |
0 |
tell |
about |
money |
0 |
3 |
4 |
6 |
1 |
tell |
not |
money |
1 |
3 |
5 |
6 |
0 |
about |
not |
money |
0 |
4 |
5 |
6 |
0 |
- Starting from the longest phrase, match the phrases in the target
comment to comments in the training set. In
matching, the order of words matched in the training comment should be
the same as the order of words in the phrase from the target comment.
For example, in finding phrases that match to "patients about" in the
target comment, the training comment 57685 is not a match, even though
it has both words. In the training comment, the order of words
is different. It would match to "about patients" but not
"patients about".
- Calculate the likelihood ratios associated with the longest
phrases. A likelihood ratio is calculated as prevalence of the
phrase among complaints divided by prevalence of same phrase among
comments that are not complaints. For phrases that are only used
in complaints, use as the likelihood ratio the number of times the
phrase is matched plus one. For phrases that are never in the
complaints, use as likelihood ratio one divided by the "number of
times the phrase is matched plus one." Details on calculation of likelihood ratio is available
elsewhere. See
Slides►
Video►
- Select among phrases of the same length, the phrase with the
likelihood ratio most different from 1.
- Set the prediction to complaint, if the likelihood ratio is above
1.
Use weighted likelihood ratios
-
Calculate the weight w, for each comment in the training set as
follows: w =Both / (Both + .8 * More +.2 * Less). In this
calculations, "Both" refers to number of words in both target and
training comments; "More" refers to number of wp(ords in the training
but not target; and "Less" refers to number of words in the target but
not in the training comment.
- Carryout above step but count a matched comment proportional to
its weight.
Use regression to predict the classification of the target comment.
- Regress the classification labels on the words in the target sentence.
Assume that you estimate the parameters in this equation: Y=β0
+ β1 Loves + β2 Patients + β3
Tell + β4 About + β5 Not + β6 Money +
ϵ
- Use the predicted value for the target sentence to classify the
sentence. The predicted value is calculated using the following
formula: Predicted Probability = EXP(β0
+ β1 + β2 + β3 + β4 + β5
+ β6) / (1 +
EXP(β0 + β1 + β2 + β3 + β4
+ β5 + β6))
- Set the comment as a complaint if the predicted probability
exceeds 0.50
Use weighted regression to classification of the target comment
- Use weighted described in weighted likelihood ratio method
- Use weighted regression and follow the same steps as regression
approach
Create a network model for predicting the sentiment of the
sentence.
- Create the structure of the network:
- Using LASSO, regress the classification variable on all of
the words in the training data
- Using LASSO, regress each word on its preceding words.
- Statistically significant variables are parents in the
Markov blanket of the regression response variable. Draw
the network using Netica.
- Estimate the parameters of the network
- Using the LASSO regression, calculate the predicted value
for all combinations of the parents in the Markov blanket of
the regression's response variable.
- Use the network to calculate the probability of complaint.
The following resources may be helpful:
- Labeled training corpus
Data►
- Madhukar Reddy Vongala's use of regression in sentiment analysis
Python►
-
Calculating likelihood ratios
SQL► (has errors)
Question 2: Repeat the analysis described in
question 1 for 25 different target comments. The
corpus used in analysis of these comments are provided below, is different
for each comment, and follows
the structure similar to what is shown in the following table. As before, the comment
ID is given in the first column, the location of words in the target sentence is provided
in columns to the right, the target sentence is given in the name of the
file:
comment Id |
comment |
type Id |
classification |
time |
valuable |
57689 |
A little less waiting
time in the waiting room would be nice. |
1 |
TRUE |
4 |
0 |
57716 |
Shorter period of
time waiting in the waiting room. |
1 |
TRUE |
3 |
0 |
57723 |
I had to wait about
an hour and a half to see the doctor for my time with my scheduled
appointment. |
1 |
TRUE |
8 |
0 |
57748 |
He took an abundance
of time answering questions. |
1 |
FALSE |
3 |
0 |
57749 |
The only thing that
could have been done, I waited a long time for the doctor. |
1 |
TRUE |
9 |
0 |
57754 |
Less time in the
waiting room. |
1 |
TRUE |
2 |
0 |
57760 |
The checkout time was
extremely slow. |
1 |
TRUE |
2 |
0 |
Complete the following three analyses:
- Calculate the likelihood ratio associated with longest phrase in
the 25 target sentence that matches any of the training comments within
the corpus. If there are more than 1 longest cohort, then select
the one with a likelihood ratio most different from 1.
- Regress the classification labels on the words in the target sentence; and
use this regression to calculate the predicted value for the target
comment.
- Use a weighted regression of classification on words in the target
sentence and use this regression to calculate the predicted value for
the target comment. Weights are calculated as before: w = Both /
(Both +.8 * More + .2 * less); where "Both" is the number of words in
both target and training comment; "More" is the number of words in the
training but not target comment; and "Less" is the number of words in
the target but not training comment.
The following resources may be helpful:
- Labeled training corpus Data►
- Blaine Donley's code
and related data
More
For additional information (not part of the required reading), please see the following links:
- Predicting psychosis through text analysis Read►
- Depression and speech Read►
- Context sensitive text analysis Read►
This page is part of the course on Comparative Effectiveness by Farrokh Alemi, Ph.D.
Home► Email►
|