Project 7: Student Habits vs Academic Performance Analysis
Overview
This project analyzes the relationship between student lifestyle habits and academic performance using a comprehensive dataset from Kaggle. The dataset contains information about 1,000 students and includes 16 variables covering various aspects of student life, including study habits, social media usage, sleep patterns, diet quality, exercise frequency, and academic outcomes.
The analysis focuses on understanding how different study patterns correlate with exam performance.The project demonstrates fundamental data analysis skills including data cleaning, statistical calculations, and comparative analysis using Python and pandas.
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter
# Set the path to the file you'd like to load
# Update file_path to point to the specific file within the dataset
file_path = "student_habits_performance.csv"
# Load the latest version
df = kagglehub.load_dataset(
KaggleDatasetAdapter.PANDAS,
"jayaantanaath/student-habits-vs-academic-performance",
file_path,
# Provide any additional arguments like
# sql_query or pandas_kwargs. See the
# documenation for more information:
# https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)
print("First 5 records:", df.head())/tmp/ipykernel_6544/3453289196.py:11: DeprecationWarning: Use dataset_load() instead of load_dataset(). load_dataset() will be removed in a future version. df = kagglehub.load_dataset(
Using Colab cache for faster access to the 'student-habits-vs-academic-performance' dataset. First 5 records: student_id age gender study_hours_per_day social_media_hours \ 0 S1000 23 Female 0.0 1.2 1 S1001 20 Female 6.9 2.8 2 S1002 21 Male 1.4 3.1 3 S1003 23 Female 1.0 3.9 4 S1004 19 Female 5.0 4.4 netflix_hours part_time_job attendance_percentage sleep_hours \ 0 1.1 No 85.0 8.0 1 2.3 No 97.3 4.6 2 1.3 No 94.8 8.0 3 1.0 No 71.0 9.2 4 0.5 No 90.9 4.9 diet_quality exercise_frequency parental_education_level internet_quality \ 0 Fair 6 Master Average 1 Good 6 High School Average 2 Poor 1 High School Poor 3 Poor 4 Master Good 4 Fair 3 Master Good mental_health_rating extracurricular_participation exam_score 0 8 Yes 56.2 1 8 No 100.0 2 1 No 34.3 3 1 Yes 26.8 4 1 No 66.4
one_col = df['age']type(one_col)pandas.core.series.Series
def __init__(data=None, index=None, dtype: Dtype | None=None, name=None, copy: bool | None=None, fastpath: bool | lib.NoDefault=lib.no_default) -> None
One-dimensional ndarray with axis labels (including time series). Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN). Operations between Series (+, -, /, \*, \*\*) align values based on their associated index values-- they need not be the same length. The result index will be the sorted union of the two indexes. Parameters ---------- data : array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained. index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values. dtype : str, numpy.dtype, or ExtensionDtype, optional Data type for the output Series. If not specified, this will be inferred from `data`. See the :ref:`user guide <basics.dtypes>` for more usages. name : Hashable, default None The name to give to the Series. copy : bool, default False Copy input data. Only affects Series or 1d ndarray input. See examples. Notes ----- Please reference the :ref:`User Guide <basics.series>` for more information. Examples -------- Constructing Series from a dictionary with an Index specified >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64 The keys of the dictionary match with the Index values, hence the Index values have no effect. >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64 Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result. Constructing Series from a list with `copy=False`. >>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `copy` of the original data even though `copy=False`, so the data is unchanged. Constructing Series from a 1d ndarray with `copy=False`. >>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `view` on the original data, so the data is changed as well.
df.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 student_id 1000 non-null object 1 age 1000 non-null int64 2 gender 1000 non-null object 3 study_hours_per_day 1000 non-null float64 4 social_media_hours 1000 non-null float64 5 netflix_hours 1000 non-null float64 6 part_time_job 1000 non-null object 7 attendance_percentage 1000 non-null float64 8 sleep_hours 1000 non-null float64 9 diet_quality 1000 non-null object 10 exercise_frequency 1000 non-null int64 11 parental_education_level 909 non-null object 12 internet_quality 1000 non-null object 13 mental_health_rating 1000 non-null int64 14 extracurricular_participation 1000 non-null object 15 exam_score 1000 non-null float64 dtypes: float64(6), int64(3), object(7) memory usage: 125.1+ KB
df.shape(1000, 16)
df.columnsIndex(['student_id', 'age', 'gender', 'study_hours_per_day',
'social_media_hours', 'netflix_hours', 'part_time_job',
'attendance_percentage', 'sleep_hours', 'diet_quality',
'exercise_frequency', 'parental_education_level', 'internet_quality',
'mental_health_rating', 'extracurricular_participation', 'exam_score'],
dtype='object')clean the data
df.isnull().sum()| 0 | |
|---|---|
| student_id | 0 |
| age | 0 |
| gender | 0 |
| study_hours_per_day | 0 |
| social_media_hours | 0 |
| netflix_hours | 0 |
| part_time_job | 0 |
| attendance_percentage | 0 |
| sleep_hours | 0 |
| diet_quality | 0 |
| exercise_frequency | 0 |
| parental_education_level | 91 |
| internet_quality | 0 |
| mental_health_rating | 0 |
| extracurricular_participation | 0 |
| exam_score | 0 |
This checks how many missing (NaN) values are in each column of the DataFrame. Based on the output:
All columns have 0 missing values except for parental_education_level, which has 91 missing values.
The dataset is mostly clean, but you do need to clean or handle the missing data in the parental_education_level column.
Fill with a default or common value
drop rows with missing values
df.dropna(subset=['parental_education_level'], inplace=True)
# checking the data again
df.isnull().sum()| 0 | |
|---|---|
| student_id | 0 |
| age | 0 |
| gender | 0 |
| study_hours_per_day | 0 |
| social_media_hours | 0 |
| netflix_hours | 0 |
| part_time_job | 0 |
| attendance_percentage | 0 |
| sleep_hours | 0 |
| diet_quality | 0 |
| exercise_frequency | 0 |
| parental_education_level | 0 |
| internet_quality | 0 |
| mental_health_rating | 0 |
| extracurricular_participation | 0 |
| exam_score | 0 |
Questions to Answer
Please find the answer for the following questoins.
- Find the average study hours per day for all students. Please create a code cell below this to answer the question.(0.5 point)
# Calculate the average study hours per day for all students.
average_study_hours = df['study_hours_per_day'].mean()
print(f"Average study hours per day for all students: {average_study_hours:.2f}")Average study hours per day for all students: 3.54
- Identify the student who studies MOST hours per day. Please create a code cell below to answer the question.(0.5 point)
# Find the maximum study hours per day among all students
max_study_hours = df['study_hours_per_day'].max()
# Filter the DataFrame to get the student(s) who study the most hours
# And select only the 'student_id' and 'study_hours_per_day' columns
student_most_hours = df[df['study_hours_per_day'] == max_study_hours][['student_id', 'study_hours_per_day']]
# Print a descriptive header and display the resulting DataFrame
print("Student(s) who studies the MOST hours per day:")
display(student_most_hours)Student(s) who studies the MOST hours per day:
| student_id | study_hours_per_day | |
|---|---|---|
| 455 | S1455 | 8.3 |
- Count how many students study more than 6 hours per day. Please create a code cell below this to answer the question.(0.5 point)
# Count the number of students who study more than 6 hours per day
count_over_6 = (df['study_hours_per_day'] > 6).sum()
print(f"Number of students who study more than 6 hours per day : {count_over_6}")Number of students who study more than 6 hours per day : 40
- What is the percentage of students who study more than 6 hours per day. Please create a code cell below this to answer the question.(0.5 point)
# Get the total number of students in the DataFrame
total_students = df.shape[0]
# Calculate the percentage of students who study more than 6 hours per day
# 'count_over_6' is defined in a previous cell
percentage_more_than_6_hours = (count_over_6 / total_students) * 100
# Print the calculated percentage, formatted to two decimal places
print(f"Percentage of students who study more than 6 hours per day: {percentage_more_than_6_hours:.2f}%")Percentage of students who study more than 6 hours per day: 4.40%
- Calculate what percentage of students study less than 2 hours per day. Please create a code cell below this to answer the question.(0.5 point)
# Count the number of students who study less than 2 hours per day
count_less_than_2 = (df['study_hours_per_day'] < 2).sum()
# Calculate the percentage of students who study less than 2 hours per day
percentage_less_than_2_hours = (count_less_than_2 / total_students) * 100
# Print the calculated percentage, formatted to two decimal places
print(f"Percentage of students who study less than 2 hours per day: {percentage_less_than_2_hours:.2f}%")Percentage of students who study less than 2 hours per day: 13.53%
- Do students who study more than 5 hours per day have higher exam scores on average? Please create a code cell below to answer this question. (0.5 point)
# Calculate the average exam score for students studying more than 5 hours per day
average_exam_score_more_than_5 = df[df['study_hours_per_day'] > 5]['exam_score'].mean()
# Calculate the average exam score for students studying 5 hours per day or less
average_exam_score_5_or_less = df[df['study_hours_per_day'] <= 5]['exam_score'].mean()
# Print the average exam score for students studying more than 5 hours per day, formatted to two decimal places
print(f"Average exam score for students studying more than 5 hours/day: {average_exam_score_more_than_5:.2f}")
# Print the average exam score for students studying 5 hours per day or less, formatted to two decimal places
print(f"Average exam score for students studying 5 hours/day or less: {average_exam_score_5_or_less:.2f}")Average exam score for students studying more than 5 hours/day: 91.12 Average exam score for students studying 5 hours/day or less: 65.67
Students who study more than 5 hours per day had higher average exam scores.
- Use "Explain code" for the code you produced for Question 6 and summarize in your own words to show that you understood the code Gemini produced. Please create a text cell below to answer this question. (0.5 point)
For average_exam_score_more-than_5, the data frame is filtered for students who study more than 5 hours/day and then the mean is calculated for exam_score. For average_exam_score_5_or_less, the same chained operation is carried out, but it first filters for study hours per day less than or equal to 5. Then, two lines are printed to display the findings.
- The codes produced to answer the questions use "vectorization"? Please justify your answer with an example. Please create a text cell below to answer this question. (0.5 point)
The code utilizes vectorization to answer the questions. For example, to filter the DataFrame for question 6, the code df['study_hours_per_day'] > 5 applies the > comparision to every single value in the column simultaneously. The result is a series of True/False values, which are then used to the filer the DataFrame with another vectorized operation.