COMP7006: Data Science Computer Based Assignment

Download Solution Order New Solution

Assignment Overview

Instructions

Please note that you are expected to answer the questions clearly in this document. Use the template included where relevant to answer. Give the R outputs, comments, and discussion clearly and logically. Attach all the R commands in the Appendix. Write the resulting model equation to the relevant questions.

Once completed, submit the answer scripts as a PDF via TurnItIn link within vUWS site.

Please note that 10 Marks are allocated for organization, reasoning, logical flow, and the inclusion of all correct R codes and outputs in the Appendix for both Part A and Part B.

Scenario

Recent public health data indicate a troubling increase in kidney disease rates within specific suburban areas, attracting significant attention from public health practitioners.

Determined to uncover the root causes and identify actionable risk factors to address this issue, the public health team has embarked on a comprehensive study.

They have collected patient records and relevant information on medical factors and water quality, as provided in the dataset.

Data Description

Variable – Description

  • PatientID: Unique identifier of each patient

  • Age: Age of the individual

  • Gender: Gender of the individual

  • BloodPressure: Systolic blood pressure in mmHg

  • BloodSugar: Fasting blood sugar levels in mg/dL

  • Cholesterol: Total cholesterol level in mg/dL

  • BodyMassIndex: BMI, a measure of body fat based on height and weight

  • SmokingStatus: Smoking status of the individual [Never / Former / Current]

  • ElectricConductivity: Measurement of the water’s ability to conduct electricity, which can indicate contamination in μS/cm

  • pH: pH level of the water

  • DissolvedOxygen: Amount of oxygen dissolved in water in mg/L

  • Turbidity: Measure of water clarity in NTU

  • TotalDissolvedSolids: Measure of dissolved substances in water in mg/L

  • NitriteLevel: Nitrite concentration in water in mg/L

  • NitrateLevel: Nitrate concentration in water in mg/L

  • LeadConcentration: Lead concentration in water in mg/L

  • ArsenicConcentration: Arsenic concentration in water in mg/L

  • Humidity: Ambient humidity level in %

  • KidneyDisease: Presence or absence of kidney disease

    • 0 – Absence of kidney disease

    • 1 – Presence of kidney disease

Please note that this is a simulated data generated to resemble real-world data for the purpose of this assignment.

Assignment Questions

Consider the scenario described and the dataset provided (KidneyData.csv) to answer the following questions.

Question 1 

Identify the target variable and clearly specify the research question.

  • Target variable:

  • Research Question:

Question 2 

Understand the data and perform the necessary data pre-processing.

Clearly explain the steps taken.
(Hint: data cleaning, make sure to divide the data into training and test set, etc.)

  • [Write the steps taken here.]

Print the structure of the data before cleaning and pre-processing here.

Print the structure of the training data after cleaning and pre-processing here.

Question 3 

Perform a thorough data exploration using the provided dataset.

You may use various visualization techniques (such as histograms, scatter plots, box plots, correlation matrices, etc.) to uncover significant patterns and insights.

Interpret your outputs and discuss key findings.
(Hint: You may use as many plots as necessary and make sure to interpret them.)

Question 4

Use logistic regression to answer the research question.

Clearly explain the process or all the steps involved.
(Hint: model building, model improvement, evaluation)

Question 5 

Give your resultant model.

Assessment Requirements – Overview

The COMP7006: Data Science Computer-Based Assignment aims to assess students’ ability to apply data science and statistical analysis techniques to real-world health data using R programming. The task involves analyzing a dataset related to kidney disease, exploring the data, performing pre-processing, building predictive models, and interpreting outcomes.

The key requirements of the assessment are as follows:

  1. Understanding the Scenario and Dataset:

    • The dataset simulates real-world public health data related to kidney disease cases in suburban areas.

    • It includes variables such as patient demographics, medical factors (blood pressure, sugar, cholesterol, BMI), and environmental factors (water quality, chemical concentrations, humidity, etc.).

  2. Questions to be Addressed:

    • Question 1: Identify the target variable and define the research question.

    • Question 2: Perform data understanding and pre-processing steps (cleaning, transformation, and data splitting).

    • Question 3: Conduct exploratory data analysis (EDA) with suitable visualizations and discuss insights.

    • Question 4: Build and evaluate a logistic regression model to answer the research question.

    • Question 5: Present the final model equation and summarize the findings.

  3. Technical Requirements:

    • Include all R outputs, discussions, and interpretations clearly in the document.

    • Attach all R commands in the appendix.

    • Ensure the report has logical organization, clear reasoning, and a cohesive narrative.

    • Submit the final solution as a PDF via the Turnitin link on the vUWS site.

    • A total of 10 marks are allocated for structure, reasoning, and inclusion of complete code and outputs.

Academic Mentor’s Approach and Guidance

The academic mentor guided the student through each stage of the assignment systematically, ensuring a solid understanding of data science principles and the correct application of R programming techniques.

Step 1: Understanding the Scenario and Defining Objectives

The mentor began by helping the student analyze the assignment brief and interpret the dataset context. The focus was on understanding the public health implications of kidney disease and how environmental and medical factors may influence its occurrence.

  • Key guidance: Identify the target variable (“KidneyDisease”) and define a clear research question, such as:
    “Which medical and environmental factors significantly contribute to the presence of kidney disease among patients?”

  • This established the foundation for building an analytical model later in the assessment.

Step 2: Data Understanding and Pre-Processing

The mentor then emphasized the importance of data quality and preparation before modeling. Together, they reviewed the dataset structure, variable types, and potential data issues.

Tasks Undertaken:

  • Checking data structure: The mentor explained how to examine the dataset’s dimensions, data types, and summary statistics.

  • Handling missing values: Guidance was given on identifying and addressing missing or inconsistent entries.

  • Encoding categorical variables: The student learned how to convert variables like Gender and SmokingStatus into appropriate formats for analysis.

  • Splitting the dataset: The data was divided into training and testing sets (e.g., 70% training, 30% testing) to ensure unbiased model evaluation.

Mentor’s focus: Explain each pre-processing step’s purpose and ensure the student could justify why each transformation was necessary for accurate model performance.

Step 3: Exploratory Data Analysis (EDA)

Once the dataset was clean, the mentor guided the student through an in-depth exploratory analysis to identify trends, relationships, and outliers.

Key EDA Components:

  • Distribution Analysis: Creating histograms and density plots for continuous variables such as Age, BloodPressure, BloodSugar, and Cholesterol to understand their spread and detect skewness.

  • Box Plots: Used to identify outliers and compare distributions between individuals with and without kidney disease.

  • Correlation Analysis: The mentor explained how correlation matrices help detect relationships between numerical variables (e.g., BloodSugar vs Cholesterol).

  • Categorical Analysis: Bar charts were used to analyze categorical factors like Gender and SmokingStatus in relation to the disease outcome.

  • Environmental Insights: The student explored how water quality variables such as pH, NitriteLevel, and LeadConcentration correlate with disease prevalence.

Outcome:

The mentor encouraged the student to interpret each visual carefully — not merely describe the charts, but explain what the data reveals about potential risk factors.

Step 4: Building the Logistic Regression Model

After exploring the data, the next step involved predictive modeling using logistic regression, as the target variable (KidneyDisease) is binary (0 = No disease, 1 = Disease present).

Mentor’s Guidance:

  1. Model Specification: The mentor explained how logistic regression estimates the probability of an event (disease presence) based on predictor variables.

  2. Variable Selection: The student learned to include significant predictors such as Age, BloodPressure, BloodSugar, Cholesterol, BMI, and environmental contaminants.

  3. Model Fitting: The logistic model was fitted using the training dataset.

  4. Model Interpretation: The mentor guided the student on interpreting coefficients — understanding which variables increase or decrease the odds of kidney disease.

  5. Model Evaluation: The model’s accuracy was validated using the test data. The mentor discussed using confusion matrices, accuracy scores, sensitivity, specificity, and ROC curves to evaluate performance.

Key Learning:

The student understood that a well-performing model is not just accurate but also interpretable, enabling insights that are actionable in public health decision-making.

Step 5: Writing the Resultant Model and Interpretation

Once the final model was optimized, the mentor helped the student document the model equation and interpret the results clearly.

The final logistic regression model was expressed in a mathematical form linking independent variables (medical and environmental factors) to the probability of kidney disease.

The student learned to:

  • Present the model equation concisely.

  • Discuss which predictors are statistically significant (e.g., BloodSugar and LeadConcentration showing strong correlation with disease presence).

  • Summarize key insights in simple, non-technical language suitable for public health interpretation.

Step 6: Documentation, Discussion, and Submission

In the final stage, the mentor emphasized the importance of clarity, structure, and professionalism in report writing.

Mentor’s Feedback:

  • Ensure that each question in the assignment is answered under a distinct heading.

  • Present R outputs and visualizations with descriptive captions.

  • Add a logical discussion for every result rather than leaving raw outputs unexplained.

  • Include all R codes in the appendix section for transparency and reproducibility.

  • Maintain a cohesive flow from data understanding → exploration → modeling → conclusion.

The mentor also reviewed the Turnitin submission process, reminding the student to ensure originality and avoid plagiarism by properly citing any external references or datasets used.

Final Outcome and Learning Achievements

By the end of the assignment, the student achieved several critical learning objectives of the COMP7006 unit:

1. Applied Data Science Workflow

The student successfully executed the complete data science process — from data preparation to model building and evaluation using R.

2. Analytical and Critical Thinking

Through mentor-guided exploration, the student developed the ability to critically analyze data, detect trends, and identify variables contributing to kidney disease risk.

3. Technical Proficiency in R

The assessment strengthened the student’s R programming skills, particularly in data manipulation, visualization, and logistic regression modeling.

4. Interpretation and Communication Skills

The student learned how to interpret statistical results and communicate findings effectively in a professional, academic format suitable for public health analysis.

5. Academic and Ethical Research Practice

By documenting R codes, maintaining logical reasoning, and ensuring transparency, the student met academic integrity standards and demonstrated sound research ethics.

Conclusion

Through the academic mentor’s structured and stepwise approach, the student effectively met all assessment requirements of the COMP7006 Data Science Computer-Based Assignment. The process not only reinforced technical competencies in data analysis and modeling but also deepened the student’s understanding of how data-driven insights can inform public health decision-making.

The final submission reflected clarity, analytical depth, and strong technical execution, fulfilling the objectives of demonstrating real-world data science application in a healthcare context.

Get Expert Guidance with Our Data Science Assignment Samples

Looking for inspiration to complete your COMP7006: Data Science Computer-Based Assignment? Download our expertly crafted sample solution to understand how to structure your answers, present R outputs, and discuss results logically. This sample serves as a valuable academic reference to guide your own writing and analysis.

Important: The sample file is provided strictly for reference purposes only. Submitting it as your own work may lead to plagiarism issues and academic penalties. Always use the sample to learn formatting, structure, and approach, not for direct submission.

If you want a fresh, original, and plagiarism-free assignment, our team of professional academic writers can craft a custom solution tailored to your topic, dataset, and university guidelines. Every solution is written from scratch and thoroughly checked for quality and originality.

Why Choose Our Custom Assignment Writing Help?

  • 100% plagiarism-free and Turnitin-safe content

  • Expert writers with experience in Data Science and R programming

  • Proper academic formatting, reasoning, and referencing

  • Fast turnaround and guaranteed confidentiality

  • Step-by-step explanations for better understanding

Get the support you need ethically and effectively!

Download Sample Solution                    Order Fresh Assignment

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.