Lorissa Hughes

Project 7: Student Habits vs Academic Performance Analysis

Overview

This project analyzes the relationship between student lifestyle habits and academic performance using a comprehensive dataset from Kaggle. The dataset contains information about 1,000 students and includes 16 variables covering various aspects of student life, including study habits, social media usage, sleep patterns, diet quality, exercise frequency, and academic outcomes.

The analysis focuses on understanding how different study patterns correlate with exam performance.The project demonstrates fundamental data analysis skills including data cleaning, statistical calculations, and comparative analysis using Python and pandas.

Code Cell 1

# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
# Update file_path to point to the specific file within the dataset
file_path = "student_habits_performance.csv"

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "jayaantanaath/student-habits-vs-academic-performance",
  file_path,
  # Provide any additional arguments like
  # sql_query or pandas_kwargs. See the
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

print("First 5 records:", df.head())

/tmp/ipykernel_6544/3453289196.py:11: DeprecationWarning: Use dataset_load() instead of load_dataset(). load_dataset() will be removed in a future version.
  df = kagglehub.load_dataset(

Using Colab cache for faster access to the 'student-habits-vs-academic-performance' dataset.
First 5 records:   student_id  age  gender  study_hours_per_day  social_media_hours  \
0      S1000   23  Female                  0.0                 1.2   
1      S1001   20  Female                  6.9                 2.8   
2      S1002   21    Male                  1.4                 3.1   
3      S1003   23  Female                  1.0                 3.9   
4      S1004   19  Female                  5.0                 4.4   

   netflix_hours part_time_job  attendance_percentage  sleep_hours  \
0            1.1            No                   85.0          8.0   
1            2.3            No                   97.3          4.6   
2            1.3            No                   94.8          8.0   
3            1.0            No                   71.0          9.2   
4            0.5            No                   90.9          4.9   

  diet_quality  exercise_frequency parental_education_level internet_quality  \
0         Fair                   6                   Master          Average   
1         Good                   6              High School          Average   
2         Poor                   1              High School             Poor   
3         Poor                   4                   Master             Good   
4         Fair                   3                   Master             Good   

   mental_health_rating extracurricular_participation  exam_score  
0                     8                           Yes        56.2  
1                     8                            No       100.0  
2                     1                            No        34.3  
3                     1                           Yes        26.8  
4                     1                            No        66.4

Code Cell 2

one_col = df['age']

Code Cell 3

type(one_col)

pandas.core.series.Series
def __init__(data=None, index=None, dtype: Dtype | None=None, name=None, copy: bool | None=None, fastpath: bool | lib.NoDefault=lib.no_default) -> None

/usr/local/lib/python3.12/dist-packages/pandas/core/series.pyOne-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, \*, \*\*) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.

Parameters
----------
data : array-like, Iterable, dict, or scalar value
    Contains data stored in Series. If data is a dict, argument order is
    maintained.
index : array-like or Index (1d)
    Values must be hashable and have the same length as `data`.
    Non-unique index values are allowed. Will default to
    RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like
    and index is None, then the keys in the data are used as the index. If the
    index is not None, the resulting Series is reindexed with the index values.
dtype : str, numpy.dtype, or ExtensionDtype, optional
    Data type for the output Series. If not specified, this will be
    inferred from `data`.
    See the :ref:`user guide <basics.dtypes>` for more usages.
name : Hashable, default None
    The name to give to the Series.
copy : bool, default False
    Copy input data. Only affects Series or 1d ndarray input. See examples.

Notes
-----
Please reference the :ref:`User Guide <basics.series>` for more information.

Examples
--------
Constructing Series from a dictionary with an Index specified

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a   1
b   2
c   3
dtype: int64

The keys of the dictionary match with the Index values, hence the Index
values have no effect.

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x   NaN
y   NaN
z   NaN
dtype: float64

Note that the Index is first build with the keys from the dictionary.
After this the Series is reindexed with the given Index values, hence we
get all NaN as a result.

Constructing Series from a list with `copy=False`.

>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `copy` of
the original data even though `copy=False`, so
the data is unchanged.

Constructing Series from a 1d ndarray with `copy=False`.

>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999,   2])
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `view` on
the original data, so
the data is changed as well.

Code Cell 4

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   student_id                     1000 non-null   object 
 1   age                            1000 non-null   int64  
 2   gender                         1000 non-null   object 
 3   study_hours_per_day            1000 non-null   float64
 4   social_media_hours             1000 non-null   float64
 5   netflix_hours                  1000 non-null   float64
 6   part_time_job                  1000 non-null   object 
 7   attendance_percentage          1000 non-null   float64
 8   sleep_hours                    1000 non-null   float64
 9   diet_quality                   1000 non-null   object 
 10  exercise_frequency             1000 non-null   int64  
 11  parental_education_level       909 non-null    object 
 12  internet_quality               1000 non-null   object 
 13  mental_health_rating           1000 non-null   int64  
 14  extracurricular_participation  1000 non-null   object 
 15  exam_score                     1000 non-null   float64
dtypes: float64(6), int64(3), object(7)
memory usage: 125.1+ KB

Code Cell 5

df.shape

(1000, 16)

Code Cell 6

df.columns

Index(['student_id', 'age', 'gender', 'study_hours_per_day',
       'social_media_hours', 'netflix_hours', 'part_time_job',
       'attendance_percentage', 'sleep_hours', 'diet_quality',
       'exercise_frequency', 'parental_education_level', 'internet_quality',
       'mental_health_rating', 'extracurricular_participation', 'exam_score'],
      dtype='object')

clean the data

Code Cell 7

df.isnull().sum()

	0
student_id	0
age	0
gender	0
study_hours_per_day	0
social_media_hours	0
netflix_hours	0
part_time_job	0
attendance_percentage	0
sleep_hours	0
diet_quality	0
exercise_frequency	0
parental_education_level	91
internet_quality	0
mental_health_rating	0
extracurricular_participation	0
exam_score	0

dtype: int64

This checks how many missing (NaN) values are in each column of the DataFrame. Based on the output:

All columns have 0 missing values except for parental_education_level, which has 91 missing values.

The dataset is mostly clean, but you do need to clean or handle the missing data in the parental_education_level column.

Fill with a default or common value

drop rows with missing values

Code Cell 8

df.dropna(subset=['parental_education_level'], inplace=True)

Code Cell 9

# checking the data again
df.isnull().sum()

	0
student_id	0
age	0
gender	0
study_hours_per_day	0
social_media_hours	0
netflix_hours	0
part_time_job	0
attendance_percentage	0
sleep_hours	0
diet_quality	0
exercise_frequency	0
parental_education_level	0
internet_quality	0
mental_health_rating	0
extracurricular_participation	0
exam_score	0

dtype: int64

Questions to Answer

Please find the answer for the following questoins.

Find the average study hours per day for all students. Please create a code cell below this to answer the question.(0.5 point)

Code Cell 10

# Calculate the average study hours per day for all students.
average_study_hours = df['study_hours_per_day'].mean()
print(f"Average study hours per day for all students: {average_study_hours:.2f}")

Average study hours per day for all students: 3.54

Identify the student who studies MOST hours per day. Please create a code cell below to answer the question.(0.5 point)

Code Cell 11

# Find the maximum study hours per day among all students
max_study_hours = df['study_hours_per_day'].max()

# Filter the DataFrame to get the student(s) who study the most hours
# And select only the 'student_id' and 'study_hours_per_day' columns
student_most_hours = df[df['study_hours_per_day'] == max_study_hours][['student_id', 'study_hours_per_day']]

# Print a descriptive header and display the resulting DataFrame
print("Student(s) who studies the MOST hours per day:")
display(student_most_hours)

Student(s) who studies the MOST hours per day:

	student_id	study_hours_per_day
455	S1455	8.3

Count how many students study more than 6 hours per day. Please create a code cell below this to answer the question.(0.5 point)

Code Cell 12

# Count the number of students who study more than 6 hours per day
count_over_6 = (df['study_hours_per_day'] > 6).sum()
print(f"Number of students who study more than 6 hours per day : {count_over_6}")

Number of students who study more than 6 hours per day : 40

What is the percentage of students who study more than 6 hours per day. Please create a code cell below this to answer the question.(0.5 point)

Code Cell 13

# Get the total number of students in the DataFrame
total_students = df.shape[0]
# Calculate the percentage of students who study more than 6 hours per day
# 'count_over_6' is defined in a previous cell
percentage_more_than_6_hours = (count_over_6 / total_students) * 100
# Print the calculated percentage, formatted to two decimal places
print(f"Percentage of students who study more than 6 hours per day: {percentage_more_than_6_hours:.2f}%")

Percentage of students who study more than 6 hours per day: 4.40%

Calculate what percentage of students study less than 2 hours per day. Please create a code cell below this to answer the question.(0.5 point)

Code Cell 14

# Count the number of students who study less than 2 hours per day
count_less_than_2 = (df['study_hours_per_day'] < 2).sum()
# Calculate the percentage of students who study less than 2 hours per day
percentage_less_than_2_hours = (count_less_than_2 / total_students) * 100
# Print the calculated percentage, formatted to two decimal places
print(f"Percentage of students who study less than 2 hours per day: {percentage_less_than_2_hours:.2f}%")

Percentage of students who study less than 2 hours per day: 13.53%

Do students who study more than 5 hours per day have higher exam scores on average? Please create a code cell below to answer this question. (0.5 point)

Code Cell 15

# Calculate the average exam score for students studying more than 5 hours per day
average_exam_score_more_than_5 = df[df['study_hours_per_day'] > 5]['exam_score'].mean()
# Calculate the average exam score for students studying 5 hours per day or less
average_exam_score_5_or_less = df[df['study_hours_per_day'] <= 5]['exam_score'].mean()

# Print the average exam score for students studying more than 5 hours per day, formatted to two decimal places
print(f"Average exam score for students studying more than 5 hours/day: {average_exam_score_more_than_5:.2f}")
# Print the average exam score for students studying 5 hours per day or less, formatted to two decimal places
print(f"Average exam score for students studying 5 hours/day or less: {average_exam_score_5_or_less:.2f}")

Average exam score for students studying more than 5 hours/day: 91.12
Average exam score for students studying 5 hours/day or less: 65.67

Students who study more than 5 hours per day had higher average exam scores.

Use "Explain code" for the code you produced for Question 6 and summarize in your own words to show that you understood the code Gemini produced. Please create a text cell below to answer this question. (0.5 point)

For average_exam_score_more-than_5, the data frame is filtered for students who study more than 5 hours/day and then the mean is calculated for exam_score. For average_exam_score_5_or_less, the same chained operation is carried out, but it first filters for study hours per day less than or equal to 5. Then, two lines are printed to display the findings.

The codes produced to answer the questions use "vectorization"? Please justify your answer with an example. Please create a text cell below to answer this question. (0.5 point)

The code utilizes vectorization to answer the questions. For example, to filter the DataFrame for question 6, the code df['study_hours_per_day'] > 5 applies the > comparision to every single value in the column simultaneously. The result is a series of True/False values, which are then used to the filer the DataFrame with another vectorized operation.