| 
   Overview
		  Transformations
			  Regression assumptions and transformations
			  Review►Quntile Transformation
			  PubMed►Adjustments for missing values 
		  
		  Regressions with rare independent variables
		  PubMed►Construction of new features
			  Use of ontology
			  in feature construction
			  Read►SNOMED CT database
			  Download►SNOMED CT codes for social determinants of diseases
			  Bard►Example of feature construction in lung cancer patients (use instructor's last name as password)
			  Read►  Learning ObjectivesIn this session, we review some of the traditional methods of adjusting regressions and then pose additional questions about improving accuracy of 
	  regression model fitting. For some of these questions, we do not have a reasonable answer and therefore this session is more speculative than 
	  other sessions in the class. After completing the activities this module you should be able to: 
		  Adjust for missing values in the dependent and independent dataReason through patterns of missing valuesInclude redundant variables in regression Adjustments for Missing ValuesSeveral approaches are available for replacing missing values so that a complete matrix format of the data (without any variable missing 
	  information) can be accomplished.  None of the existing approaches 
	  work when there is extensive non-random missing values.     AssignmentsAssignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All 
	  assignments should be done in R if possible.  Question 1 Demo of Reasoning through Patterns of Missing Values: Regress incidence of diabetes on all 
	  other variables indicating progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that 
	  may take several hours. 
		  Remove all instances where incidence of diabetes is not known. Report the number of cases that remain.For each independent variable, define an indicator that is 1 when 
		  the variable is missing and 0 otherwise.  Set the independent 
		  variable to 0, if it is missing.LASSO Regress diabetes on all independent variables, pair of independent variables, and triplet independent variables.  Report 
		  the McFadden R-square and the coefficients for the variables.LASSO Regress diabetes on all indicators of missing variables, pair of indicators, and triplet of indicators. Report the McFadden 
		  R-square and the coefficients for the variables.Regress diabetes on both all independent variables, indicators of missingness, and pairwise and triple combination of these variables.  
		  Report the McFadden R-squared. Question 2 Demo of How to Include Redundant Variables: We generated 1000 data points for repeated orthagonal factorial combination of 3 variables.  We set the 
		response variable to be the same as the three-way interaction among these 3 independent variables.   
		  Regress Y on the independent variables with and without the rare 
		  variables.  What is the meaning of the error messages?  What conclusion can you reach regarding the effectiveness of rare 
		  variables?What conclusion can you reach about rare variable combinations 
		  that corresponding exactly to positive Y values? Question 3 Demo of Constructing Features: Body system 
		are broader codes within SNOMED, they contain diseases.  Use the 
		body system ontology to create a new feature within All of Us database.  
		The steps include: 
		  For the "Breast and Endocrine Structures" body system create a feature that 
		  scores the worse diagnosis within this body system.  Here is a 
		  list of all body systems within SNOMED CT and their associated code.
 
			  
				  
			  
			  
				  | Body system |  
				  | 
					  
						  Breast and endocrine structures |  
				  |  |  
				  |  |  
				  |  |  
				  |  |  
				  | 
					  
						  Integumentary system structure |  
				  | 
					  
						  Lung and mediastinal structures |  
				  | 
					  
						  Lymphatic system structure |  
				  | 
					  
						  Lymphoreticular structure |  
				  | 
					  
						  Musculoskeletal structure |  
				  |  |  
				  | 
					  
						  Respiratory and intrathoracic structures |  
				  | 
					  
						  Structure of hematological system |  
				  | 
					  
						  Structure of skin and/or mucous membrane |  
				  | 
					  
						  Structure of skin and/or surface epithelium |  
				  | 
					  
						  Structure of special senses organ system |  
				  |  |  
				  | Social determinants of diseases
				  See Bard |  
Identify diseases that are part of "Breast and Endocrine 
		  Structures".  You can do this using Bard, but make sure that you 
		  insist it provides a complete list.  I have done so and posted it 
		  below.  Create a binary variable for each disease, where if the 
		  patient has the disease they have a code of 1 and 0 otherwise.
		  Excel►
 
Regress your cancer on the members of this body system.  Estimate the coefficients associated with these diseases.  In 
		  this regression you do not need any other variable besides these diseases.  You do not need to include combination of diseases. 
		  Your cancer is the response variable.  The independent variables are binary variables, one for each one of these diseases. The 
		  relationshp between coefficients of a logistic regression and odds ratio of the response variable are described in the following 
		  document.
		  Read►  
		  
 
Create a new variable in All of Us database called "Worst of 
		  Breast and Endocrine Structures Body System" or "Worst of Endocrine", where the patient is assigned the exponential of the 
		  variable with the highest coefficient.   MoreFor additional information (not part of the required reading), please see the following links: 
			Open introduction to statistics
			Read► 
  This page is part of the HAP 819 course on Advance Statistics and was 
organized by Farrokh Alemi PhD Home► 
Email►  
        
         |