Please note that you are expected to answer the questions clearly in this document. Use the template included where relevant to answer. Give the R outputs, comments, and discussion clearly and logically. Attach all the R commands in the Appendix. Write the resulting model equation to the relevant questions.
Once completed, submit the answer scripts as a PDF via TurnItIn link within vUWS site.
Please note that 10 Marks are allocated for organization, reasoning, logical flow, and the inclusion of all correct R codes and outputs in the Appendix for both Part A and Part B.
Recent public health data indicate a troubling increase in kidney disease rates within specific suburban areas, attracting significant attention from public health practitioners.
Determined to uncover the root causes and identify actionable risk factors to address this issue, the public health team has embarked on a comprehensive study.
They have collected patient records and relevant information on medical factors and water quality, as provided in the dataset.
Variable – Description
PatientID: Unique identifier of each patient
Age: Age of the individual
Gender: Gender of the individual
BloodPressure: Systolic blood pressure in mmHg
BloodSugar: Fasting blood sugar levels in mg/dL
Cholesterol: Total cholesterol level in mg/dL
BodyMassIndex: BMI, a measure of body fat based on height and weight
SmokingStatus: Smoking status of the individual [Never / Former / Current]
ElectricConductivity: Measurement of the water’s ability to conduct electricity, which can indicate contamination in μS/cm
pH: pH level of the water
DissolvedOxygen: Amount of oxygen dissolved in water in mg/L
Turbidity: Measure of water clarity in NTU
TotalDissolvedSolids: Measure of dissolved substances in water in mg/L
NitriteLevel: Nitrite concentration in water in mg/L
NitrateLevel: Nitrate concentration in water in mg/L
LeadConcentration: Lead concentration in water in mg/L
ArsenicConcentration: Arsenic concentration in water in mg/L
Humidity: Ambient humidity level in %
KidneyDisease: Presence or absence of kidney disease
0 – Absence of kidney disease
1 – Presence of kidney disease
Please note that this is a simulated data generated to resemble real-world data for the purpose of this assignment.
Consider the scenario described and the dataset provided (KidneyData.csv) to answer the following questions.
Question 1
Identify the target variable and clearly specify the research question.
Target variable:
Research Question:
Question 2
Understand the data and perform the necessary data pre-processing.
Clearly explain the steps taken.
(Hint: data cleaning, make sure to divide the data into training and test set, etc.)
[Write the steps taken here.]
Print the structure of the data before cleaning and pre-processing here.
Print the structure of the training data after cleaning and pre-processing here.
Question 3
Perform a thorough data exploration using the provided dataset.
You may use various visualization techniques (such as histograms, scatter plots, box plots, correlation matrices, etc.) to uncover significant patterns and insights.
Interpret your outputs and discuss key findings.
(Hint: You may use as many plots as necessary and make sure to interpret them.)
Question 4
Use logistic regression to answer the research question.
Clearly explain the process or all the steps involved.
(Hint: model building, model improvement, evaluation)
Question 5
Give your resultant model.
The COMP7006: Data Science Computer-Based Assignment aims to assess students’ ability to apply data science and statistical analysis techniques to real-world health data using R programming. The task involves analyzing a dataset related to kidney disease, exploring the data, performing pre-processing, building predictive models, and interpreting outcomes.
The key requirements of the assessment are as follows:
Understanding the Scenario and Dataset:
The dataset simulates real-world public health data related to kidney disease cases in suburban areas.
It includes variables such as patient demographics, medical factors (blood pressure, sugar, cholesterol, BMI), and environmental factors (water quality, chemical concentrations, humidity, etc.).
Questions to be Addressed:
Question 1: Identify the target variable and define the research question.
Question 2: Perform data understanding and pre-processing steps (cleaning, transformation, and data splitting).
Question 3: Conduct exploratory data analysis (EDA) with suitable visualizations and discuss insights.
Question 4: Build and evaluate a logistic regression model to answer the research question.
Question 5: Present the final model equation and summarize the findings.
Technical Requirements:
Include all R outputs, discussions, and interpretations clearly in the document.
Attach all R commands in the appendix.
Ensure the report has logical organization, clear reasoning, and a cohesive narrative.
Submit the final solution as a PDF via the Turnitin link on the vUWS site.
A total of 10 marks are allocated for structure, reasoning, and inclusion of complete code and outputs.
The academic mentor guided the student through each stage of the assignment systematically, ensuring a solid understanding of data science principles and the correct application of R programming techniques.
The mentor began by helping the student analyze the assignment brief and interpret the dataset context. The focus was on understanding the public health implications of kidney disease and how environmental and medical factors may influence its occurrence.
Key guidance: Identify the target variable (“KidneyDisease”) and define a clear research question, such as:
“Which medical and environmental factors significantly contribute to the presence of kidney disease among patients?”
This established the foundation for building an analytical model later in the assessment.
The mentor then emphasized the importance of data quality and preparation before modeling. Together, they reviewed the dataset structure, variable types, and potential data issues.
Checking data structure: The mentor explained how to examine the dataset’s dimensions, data types, and summary statistics.
Handling missing values: Guidance was given on identifying and addressing missing or inconsistent entries.
Encoding categorical variables: The student learned how to convert variables like Gender and SmokingStatus into appropriate formats for analysis.
Splitting the dataset: The data was divided into training and testing sets (e.g., 70% training, 30% testing) to ensure unbiased model evaluation.
Mentor’s focus: Explain each pre-processing step’s purpose and ensure the student could justify why each transformation was necessary for accurate model performance.
Once the dataset was clean, the mentor guided the student through an in-depth exploratory analysis to identify trends, relationships, and outliers.
Distribution Analysis: Creating histograms and density plots for continuous variables such as Age, BloodPressure, BloodSugar, and Cholesterol to understand their spread and detect skewness.
Box Plots: Used to identify outliers and compare distributions between individuals with and without kidney disease.
Correlation Analysis: The mentor explained how correlation matrices help detect relationships between numerical variables (e.g., BloodSugar vs Cholesterol).
Categorical Analysis: Bar charts were used to analyze categorical factors like Gender and SmokingStatus in relation to the disease outcome.
Environmental Insights: The student explored how water quality variables such as pH, NitriteLevel, and LeadConcentration correlate with disease prevalence.
The mentor encouraged the student to interpret each visual carefully — not merely describe the charts, but explain what the data reveals about potential risk factors.
After exploring the data, the next step involved predictive modeling using logistic regression, as the target variable (KidneyDisease) is binary (0 = No disease, 1 = Disease present).
Model Specification: The mentor explained how logistic regression estimates the probability of an event (disease presence) based on predictor variables.
Variable Selection: The student learned to include significant predictors such as Age, BloodPressure, BloodSugar, Cholesterol, BMI, and environmental contaminants.
Model Fitting: The logistic model was fitted using the training dataset.
Model Interpretation: The mentor guided the student on interpreting coefficients — understanding which variables increase or decrease the odds of kidney disease.
Model Evaluation: The model’s accuracy was validated using the test data. The mentor discussed using confusion matrices, accuracy scores, sensitivity, specificity, and ROC curves to evaluate performance.
The student understood that a well-performing model is not just accurate but also interpretable, enabling insights that are actionable in public health decision-making.
Once the final model was optimized, the mentor helped the student document the model equation and interpret the results clearly.
The final logistic regression model was expressed in a mathematical form linking independent variables (medical and environmental factors) to the probability of kidney disease.
The student learned to:
Present the model equation concisely.
Discuss which predictors are statistically significant (e.g., BloodSugar and LeadConcentration showing strong correlation with disease presence).
Summarize key insights in simple, non-technical language suitable for public health interpretation.
In the final stage, the mentor emphasized the importance of clarity, structure, and professionalism in report writing.
Ensure that each question in the assignment is answered under a distinct heading.
Present R outputs and visualizations with descriptive captions.
Add a logical discussion for every result rather than leaving raw outputs unexplained.
Include all R codes in the appendix section for transparency and reproducibility.
Maintain a cohesive flow from data understanding → exploration → modeling → conclusion.
The mentor also reviewed the Turnitin submission process, reminding the student to ensure originality and avoid plagiarism by properly citing any external references or datasets used.
By the end of the assignment, the student achieved several critical learning objectives of the COMP7006 unit:
The student successfully executed the complete data science process — from data preparation to model building and evaluation using R.
Through mentor-guided exploration, the student developed the ability to critically analyze data, detect trends, and identify variables contributing to kidney disease risk.
The assessment strengthened the student’s R programming skills, particularly in data manipulation, visualization, and logistic regression modeling.
The student learned how to interpret statistical results and communicate findings effectively in a professional, academic format suitable for public health analysis.
By documenting R codes, maintaining logical reasoning, and ensuring transparency, the student met academic integrity standards and demonstrated sound research ethics.
Through the academic mentor’s structured and stepwise approach, the student effectively met all assessment requirements of the COMP7006 Data Science Computer-Based Assignment. The process not only reinforced technical competencies in data analysis and modeling but also deepened the student’s understanding of how data-driven insights can inform public health decision-making.
The final submission reflected clarity, analytical depth, and strong technical execution, fulfilling the objectives of demonstrating real-world data science application in a healthcare context.
Looking for inspiration to complete your COMP7006: Data Science Computer-Based Assignment? Download our expertly crafted sample solution to understand how to structure your answers, present R outputs, and discuss results logically. This sample serves as a valuable academic reference to guide your own writing and analysis.
Important: The sample file is provided strictly for reference purposes only. Submitting it as your own work may lead to plagiarism issues and academic penalties. Always use the sample to learn formatting, structure, and approach, not for direct submission.
If you want a fresh, original, and plagiarism-free assignment, our team of professional academic writers can craft a custom solution tailored to your topic, dataset, and university guidelines. Every solution is written from scratch and thoroughly checked for quality and originality.
100% plagiarism-free and Turnitin-safe content
Expert writers with experience in Data Science and R programming
Proper academic formatting, reasoning, and referencing
Fast turnaround and guaranteed confidentiality
Step-by-step explanations for better understanding
Get the support you need ethically and effectively!
Download Sample Solution Order Fresh Assignment
© Copyright 2025 My Uni Papers – Student Hustle Made Hassle Free. All rights reserved.