Lecture: Context Dependent Text Analysis  

 

Assigned Reading

Assignment

For this assignment you can use any statistical software. 

Question 1: Use the following corpus of training data.  Classify if the following sentence is a complaint: "He loves his patients and I can tell it's about us and not the money." The corpus is organized as in the following table.  The comment ID shows the comment in the training data.  The columns on the right of the table show where in the training comment the word appears.  For example, in the training comment 57685 the word patient is the third word in the comment.

comment
Id
Type
Id
Classification
True = Complaint
False = Praise
loves patients tell about not money
57685 1 TRUE 0 3 0 2 0 0
57688 1 TRUE 0 0 1 0 0 0
57703 1 TRUE 0 0 0 0 0 9
57704 1 TRUE 0 0 0 3 0 0
57711 1 FALSE 0 8 0 0 0 0
57712 1 TRUE 0 0 0 0 2 0
 

Predict the classification of the target phrase based on the likelihood ratio of the longest phrase in the target phrase that matches the training corpus. 

  1. Generate phrases of different lengths (1 through 6 words), that keep the order of the words in the target comment but drop 1, or 2, words at a time.  Note that these phrases are of different lengths.  If we want phrases of length 2 that keep the order in the target comment, then this code will be useful:

    SELECT T.Word, T_1.Word, [T_1].[order]-[T].[order]-1 AS WordsDropped, T.Order, T_1.Order FROM T, T AS T_1 WHERE ((([T_1].[order]-[T].[order]-1)<=2 And ([T_1].[order]-[T].[order]-1)>=0) AND ((T_1.Order)>[T].[Order])) ORDER BY T.Order, T_1.Order;

    In this code, T, and T_1, refer to the following Table that captures the words in the target comment:

    Order Word
    1 Loves
    2 patients
    3 tell
    4 about
    5 not
    6 money

    The result of the code is the following phrases of length of 2 words. 
    T.Word T_1.Word Words
    Dropped
    T.Order T_1.Order
    Loves patients 0 1 2
    Loves tell 1 1 3
    Loves about 2 1 4
    patients tell 0 2 3
    patients about 1 2 4
    patients not 2 2 5
    tell about 0 3 4
    tell not 1 3 5
    tell money 2 3 6
    about not 0 4 5
    about money 1 4 6
    not money 0 5 6

    For phrases of length 3, the following code can be used:

    SELECT T.Word, T_1.Word, T_2.Word, [T_1].[order]-[T].[order]-1 AS WordsDropped, T.Order, T_1.Order, T_2.Order, [T_2].[order]-[T_1].[order]-1 AS WordsDropped2 FROM T, T AS T_1, T AS T_2 WHERE ((([T_1].[order]-[T].[order]-1)<=2 And ([T_1].[order]-[T].[order]-1)>=0) AND ((T_1.Order)>[T].[Order]) AND (([T_2].[order]-[T_1].[order]-1)<=2 And ([T_2].[order]-[T_1].[order]-1)>=0)) ORDER BY T.Order, T_1.Order, T_2.Order;

    This code produces the following phrases of length 3:

    T.Word T_1.Word T_2.Word Words
    Dropped
    T.Order T_1.Order T_2.Order Words
    Dropped 2
    Loves patients tell 0 1 2 3 0
    Loves patients about 0 1 2 4 1
    Loves patients not 0 1 2 5 2
    Loves tell about 1 1 3 4 0
    Loves tell not 1 1 3 5 1
    Loves tell money 1 1 3 6 2
    Loves about not 2 1 4 5 0
    Loves about money 2 1 4 6 1
    patients tell about 0 2 3 4 0
    patients tell not 0 2 3 5 1
    patients tell money 0 2 3 6 2
    patients about not 1 2 4 5 0
    patients about money 1 2 4 6 1
    patients not money 2 2 5 6 0
    tell about not 0 3 4 5 0
    tell about money 0 3 4 6 1
    tell not money 1 3 5 6 0
    about not money 0 4 5 6 0

  2. Starting from the longest phrase, match the phrases in the target comment to comments in the training set. In matching, the order of words matched in the training comment should be the same as the order of words in the phrase from the target comment.  For example, in finding phrases that match to "patients about" in the target comment, the training comment 57685 is not a match, even though it has both words.  In the training comment, the order of words is different.  It would match to "about patients" but not "patients about". 
  3. Calculate the likelihood ratios associated with the longest phrases.  A likelihood ratio is calculated as prevalence of the phrase among complaints divided by prevalence of same phrase among comments that are not complaints.  For phrases that are only used in complaints, use as the likelihood ratio the number of times the phrase is matched plus one.  For phrases that are never in the complaints, use as likelihood ratio one divided by the "number of times the phrase is matched plus one."  Details on calculation of likelihood ratio is available elsewhere. See Slides► Video►
  4. Select among phrases of the same length, the phrase with the likelihood ratio most different from 1. 
  5. Set the prediction to complaint, if the likelihood ratio is above 1.

Use weighted likelihood ratios

  1. Calculate the weight w, for each comment in the training set as follows:  w =Both / (Both + .8 * More +.2 * Less).  In this calculations, "Both" refers to number of words in both target and training comments; "More" refers to number of wp(ords in the training but not target; and "Less" refers to number of words in the target but not in the training comment.
  2. Carryout above step but count a matched comment proportional to its weight. 

Use regression to predict the classification of the target comment.

  1. Regress the classification labels on the words in the target sentence.  Assume that you estimate the parameters in this equation: Y=β0 + β1 Loves + β2 Patients + β3 Tell + β4 About + β5 Not + β6 Money + ϵ
  2. Use the predicted value for the target sentence to classify the sentence.  The predicted value is calculated using the following formula: Predicted Probability = EXP(β0 + β1 + β2 + β3 + β4 + β5 + β6) / (1 + EXP(β0 + β1 + β2 + β3 + β4 + β5 + β6))
  3. Set the comment as a complaint if the predicted probability exceeds 0.50

Use weighted regression to classification of the target comment

  1. Use weighted described in weighted likelihood ratio method
  2. Use weighted regression and follow the same steps as regression approach

Create a network model for predicting the sentiment of the sentence.

  1. Create the structure of the network:
    • Using LASSO, regress the classification variable on all of the words in the training data
    • Using LASSO, regress each word on its preceding words.
    • Statistically significant variables are parents in the Markov blanket of the regression response variable.  Draw the network using Netica.
  2. Estimate the parameters of the network
    • Using the LASSO regression, calculate the predicted value for all combinations of the parents in the Markov blanket of the regression's response variable.
  3. Use the network to calculate the probability of complaint. 

The following resources may be helpful:

  • Labeled training corpus Data►
  • Madhukar Reddy Vongala's use of regression in sentiment analysis Python►
  • Calculating likelihood ratios SQL► (has errors)

Question 2:  Repeat the analysis described in question 1 for 25 different target comments.  The corpus used in analysis of these comments are provided below, is different for each comment, and follows the structure similar to what is shown in the following table.  As before, the comment ID is given in the first column, the location of words in the target sentence is provided in columns to the right, the target sentence is given in the name of the file:

 
comment
Id
comment type
Id
classification time valuable
57689 A little less waiting time in the waiting room would be nice. 1 TRUE 4 0
57716 Shorter period of time waiting in the waiting room. 1 TRUE 3 0
57723 I had to wait about an hour and a half to see the doctor for my time with my scheduled appointment.   1 TRUE 8 0
57748 He took an abundance of time answering questions. 1 FALSE 3 0
57749 The only thing that could have been done, I waited a long time for the doctor. 1 TRUE 9 0
57754 Less time in the waiting room. 1 TRUE 2 0
57760 The checkout time was extremely slow. 1 TRUE 2 0

Complete the following three analyses:

  1. Calculate the likelihood ratio associated with longest phrase in the 25 target sentence that matches any of the training comments within the corpus.  If there are more than 1 longest cohort, then select the one with a likelihood ratio most different from 1.
  2. Regress the classification labels on the words in the target sentence; and use this regression to calculate the predicted value for the target comment.
  3. Use a weighted regression of classification on words in the target sentence and use this regression to calculate the predicted value for the target comment.  Weights are calculated as before: w = Both / (Both +.8 * More + .2 * less); where "Both" is the number of words in both target and training comment; "More" is the number of words in the training but not target comment; and "Less" is the number of words in the target but not training comment.

The following resources may be helpful:

More

For additional information (not part of the required reading), please see the following links:

  1. Predicting psychosis through text analysis Read►
  2. Depression and speech Read►
  3. Context sensitive text analysis Read►

This page is part of the course on Comparative Effectiveness by Farrokh Alemi, Ph.D. Home► Email►