Overview
In this project, you will apply data exploration and visualization techniques to a real-world dataset of your choice. You'll practice using pandas for data analysis and create meaningful visualizations to gain insights from your data. This project builds on your Python programming skills while introducing you to the exciting world of data science.
In Lab6, you explore the titanic dataset and you should use it as a sample and pick a dataset of your interest to do a similar data exploration and visualization.
Project Objectives
- Find and select a dataset from a real-world data source
- Formulate a clear inquiry question based on your chosen dataset
- Perform exploratory data analysis using pandas functions
- Create at least two meaningful data visualizations
- Analyze your findings and answer your inquiry question
- Reflect on the insights gained through data exploration
Data Sources
You can choose from various online data sources such as:
- Kaggle - Wide variety of datasets
- data.gov - U.S. government data
- Our World in Data - Global development data
- GitHub - Curated datasets
- Google Dataset Search - Search for datasets
- Or any other reliable data source you find interesting
Part 1: Dataset Selection and Initial Exploration
TODO Tasks:
- Select Your Dataset: Choose a dataset that interests you. Make sure it has at least 100 rows and multiple columns with different data types (e.g., numerical, categorical, dates).
Please create either text or code cells in between the TODO tasks to do the work.
AMERICA’S TOP 100 CHARITIES: https://www.kaggle.com/datasets/gingfrecs03/americas-top-100-charities$0
- Import Libraries: Import the necessary libraries (pandas, matplotlib, seaborn) at the beginning of your notebook by creating code cells below this text cell.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns- Load Your Data: Load your chosen dataset using pandas and save it to a pandas dataframe. Please create a code cell below this text cell to do it.
import kagglehub
# Download latest version
path = kagglehub.dataset_download("gingfrecs03/americas-top-100-charities")
print("Path to dataset files:", path)Using Colab cache for faster access to the 'americas-top-100-charities' dataset. Path to dataset files: /kaggle/input/americas-top-100-charities
import os
# Construct the full path to the CSV file
csv_file_path = os.path.join(path, "us_charity_forbes_data.csv")
# Load the CSV into a pandas DataFrame
df = pd.read_csv(csv_file_path)
# Display the first 5 rows of the DataFrame
print(df.head())rank name category \ 0 1 Feeding America Domestic Needs 1 2 Good 360 International Needs 2 3 United Way Worldwide Domestic Needs 3 4 St. Jude Children's Research Hospital Medical 4 5 Direct Relief International Needs private donations total revenue fundraising efficiency % \ 0 $4.27 B $4.36 B 99 1 $2.67 B $2.69 B 100 2 $2.6 B $3.51 B 88 3 $2.46 B $3.3 B 84 4 $2.26 B $2.27 B 100 charitable commitment % headquarter total_expense \ 0 98 Chicago, Illinois $4.3B 1 100 Alexandria, Virginia $2.7B 2 85 Alexandria, Virginia $3.5B 3 70 Memphis, Tennessee $2.4B 4 99 Santa Barbara, California $2.1B tot_exp_char_service tot_exp_management & general tot_exp_fundraising \ 0 $4.3B $15M $63M 1 $2.7B $3M $2M 2 $3B $234M $308M 3 $1.6B $318M $404M 4 $2.1B $9M $3M surplus_loss Net_Assets Donor_Dependency Highest_Compensation 0 $13M $578M 100% $959,665 1 $19M $73M 99% $257,551 2 Na Na 100%S $3,622,146 3 $926M $10.2B 62% $1,546,372 4 $198M $1.1B 91% $652,610
- Data Summary: Use the following functions to do inital exploration your dataset:
- info() - Display information about the dataset structure
- describe() - Show statistical summary of numerical columns
- head() - Display the first 5 rows (df.head())
- shape - Show the dimensions of the dataset (e.g. df.shape)
- columns - Get the column names (e.g. df.columns)
You may use 5 code cells below this text cell to accomplish it.
print(df.info())<class 'pandas.core.frame.DataFrame'> RangeIndex: 100 entries, 0 to 99 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 rank 100 non-null int64 1 name 100 non-null object 2 category 100 non-null object 3 private donations 100 non-null object 4 total revenue 100 non-null object 5 fundraising efficiency % 100 non-null int64 6 charitable commitment % 100 non-null int64 7 headquarter 100 non-null object 8 total_expense 100 non-null object 9 tot_exp_char_service 99 non-null object 10 tot_exp_management & general 99 non-null object 11 tot_exp_fundraising 98 non-null object 12 surplus_loss 100 non-null object 13 Net_Assets 100 non-null object 14 Donor_Dependency 100 non-null object 15 Highest_Compensation 100 non-null object dtypes: int64(3), object(13) memory usage: 12.6+ KB None
print(df.describe())rank fundraising efficiency % charitable commitment % count 100.000000 100.000000 100.000000 mean 50.500000 91.390000 87.080000 std 29.011492 7.027931 8.662796 min 1.000000 67.000000 58.000000 25% 25.750000 86.000000 81.000000 50% 50.500000 93.000000 88.000000 75% 75.250000 97.000000 94.000000 max 100.000000 100.000000 100.000000
print(df.head())rank name category \ 0 1 Feeding America Domestic Needs 1 2 Good 360 International Needs 2 3 United Way Worldwide Domestic Needs 3 4 St. Jude Children's Research Hospital Medical 4 5 Direct Relief International Needs private donations total revenue fundraising efficiency % \ 0 $4.27 B $4.36 B 99 1 $2.67 B $2.69 B 100 2 $2.6 B $3.51 B 88 3 $2.46 B $3.3 B 84 4 $2.26 B $2.27 B 100 charitable commitment % headquarter total_expense \ 0 98 Chicago, Illinois $4.3B 1 100 Alexandria, Virginia $2.7B 2 85 Alexandria, Virginia $3.5B 3 70 Memphis, Tennessee $2.4B 4 99 Santa Barbara, California $2.1B tot_exp_char_service tot_exp_management & general tot_exp_fundraising \ 0 $4.3B $15M $63M 1 $2.7B $3M $2M 2 $3B $234M $308M 3 $1.6B $318M $404M 4 $2.1B $9M $3M surplus_loss Net_Assets Donor_Dependency Highest_Compensation 0 $13M $578M 100% $959,665 1 $19M $73M 99% $257,551 2 Na Na 100%S $3,622,146 3 $926M $10.2B 62% $1,546,372 4 $198M $1.1B 91% $652,610
print(df.shape)(100, 16)
print(df.columns)Index(['rank', 'name', 'category', 'private donations', 'total revenue',
'fundraising efficiency %', 'charitable commitment %', 'headquarter',
'total_expense', 'tot_exp_char_service', 'tot_exp_management & general',
'tot_exp_fundraising', 'surplus_loss', 'Net_Assets', 'Donor_Dependency',
'Highest_Compensation'],
dtype='object')
- Inquiry Question: Write a clear, specific question that you want to answer using your dataset by creating a text cell below and write the question in bold please.
For example:
- "How has the average temperature changed in major cities over the past decade?"
- "What factors are most strongly correlated with housing prices?"
- "Is there a relationship between education level and income?"
What factors are associated with efficient fundraising?
Part 2: Data Visualization and Analysis
TODO Tasks:
- Data Cleaning (if necessary): Check for and handle any missing values, duplicates, or data quality issues.
NOTE: This could be a daunting task even for seasoned data scientists. My suggestion is to use a dataset that is already clean if you can for the purpose of completing this assignment in a week.
Examples:
- Checking for missing values: df.isnull().sum()
- Dropping rows with any missing values: df.dropna(inplace=True)
- Dropping columns with all missing values: df.dropna(axis=1, how='all', inplace=True)
- Removing duplicate row: df.drop_duplicates(inplace=True)
Data Cleaning: Checking Missing Values and Duplicates
# Check for missing values
print("Missing values before cleaning:")
print(df.isnull().sum())Missing values before cleaning: rank 0 name 0 category 0 private donations 0 total revenue 0 fundraising efficiency % 0 charitable commitment % 0 headquarter 0 total_expense 0 tot_exp_char_service 1 tot_exp_management & general 1 tot_exp_fundraising 2 surplus_loss 0 Net_Assets 0 Donor_Dependency 0 Highest_Compensation 0 dtype: int64
# Check for duplicate rows
print("\nNumber of duplicate rows:", df.duplicated().sum())Number of duplicate rows: 0
Data Cleaning: Converting Financial Columns to Numeric
def clean_and_convert_financial(series):
# Replace 'Na' with NaN
series = series.replace('Na', float('NaN'))
# Remove '$', ',', ' ' and '%' signs
series = series.astype(str).str.replace('$', '', regex=False).str.replace(',', '', regex=False).str.replace(' ', '', regex=False)
# Handle 'B' (billions), 'M' (millions)
def convert_to_numeric(value):
if isinstance(value, str):
value = value.upper()
if 'B' in value: # Billions
return float(value.replace('B', '')) * 1_000_000_000
elif 'M' in value: # Millions
return float(value.replace('M', '')) * 1_000_000
elif '%' in value: # Percentages (for Donor_Dependency and others)
return float(value.replace('%', '')) # Keep as whole number percentage for now
try:
return float(value)
except (ValueError, TypeError):
return float('NaN')
return series.apply(convert_to_numeric)# List of financial columns to clean and convert
financial_columns = [
'private donations',
'total revenue',
'total_expense',
'tot_exp_char_service',
'tot_exp_management & general',
'tot_exp_fundraising',
'surplus_loss',
'Net_Assets',
'Highest_Compensation'
]
# Apply the cleaning function to each financial column
for col in financial_columns:
df[col] = clean_and_convert_financial(df[col])
# The 'Donor_Dependency' column needs special handling for non-numeric characters like '100%S'
# Remove all non-numeric characters except for '.' before converting to float
df['Donor_Dependency'] = df['Donor_Dependency'].astype(str).str.replace(r'[^0-9.]', '', regex=True).replace('', float('NaN')).astype(float)
# Re-check df.info() to see the updated dtypes and missing values after conversion
print("\nDataFrame info after converting financial columns:")
print(df.info())
# Display min and max values for the cleaned financial columns and Donor_Dependency
print("\nMin and Max values for financial columns:")
all_cleaned_numeric_cols = financial_columns + ['Donor_Dependency']
for col in all_cleaned_numeric_cols:
if col in df.columns and pd.api.types.is_numeric_dtype(df[col]):
print(f"{col}: Min={df[col].min():,.2f}, Max={df[col].max():,.2f}")DataFrame info after converting financial columns: <class 'pandas.core.frame.DataFrame'> RangeIndex: 100 entries, 0 to 99 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 rank 100 non-null int64 1 name 100 non-null object 2 category 100 non-null object 3 private donations 100 non-null float64 4 total revenue 100 non-null float64 5 fundraising efficiency % 100 non-null int64 6 charitable commitment % 100 non-null int64 7 headquarter 100 non-null object 8 total_expense 99 non-null float64 9 tot_exp_char_service 99 non-null float64 10 tot_exp_management & general 99 non-null float64 11 tot_exp_fundraising 98 non-null float64 12 surplus_loss 97 non-null float64 13 Net_Assets 97 non-null float64 14 Donor_Dependency 99 non-null float64 15 Highest_Compensation 99 non-null float64 dtypes: float64(10), int64(3), object(3) memory usage: 12.6+ KB None Min and Max values for financial columns: private donations: Min=186,000,000.00, Max=4,270,000,000.00 total revenue: Min=-474,000,000.00, Max=28,560,000,000.00 total_expense: Min=60,000,000.00, Max=28,000,000,000.00 tot_exp_char_service: Min=35,000,000.00, Max=22,700,000,000.00 tot_exp_management & general: Min=1,000,000.00, Max=5,300,000,000.00 tot_exp_fundraising: Min=1,000,000.00, Max=404,000,000.00 surplus_loss: Min=-1,700,000,000.00, Max=2,200,000,000.00 Net_Assets: Min=32,000,000.00, Max=19,900,000,000.00 Highest_Compensation: Min=73,940.00, Max=9,480,164.00 Donor_Dependency: Min=20.00, Max=780.00
Data Cleaning: Handling Remaining Missing Values
# For the purpose of this analysis, we will fill the missing expense values with 0,
# assuming 'NaN' implies no expenditure in that category for that charity.
# Alternatively, we could fill with mean/median or drop rows if more rigorous imputation is needed.
columns_to_fill_na = [
'tot_exp_char_service',
'tot_exp_management & general',
'tot_exp_fundraising'
]
for col in columns_to_fill_na:
df[col].fillna(0, inplace=True)
# For other columns like 'total_expense', 'surplus_loss', 'Net_Assets', 'Highest_Compensation',
# and 'Donor_Dependency' (if any NaNs remain from conversion issues)
# we'll fill with the median for a more robust approach, or 0 if it makes sense.
# Let's check remaining NaNs for these first.
# Fill remaining NaNs in 'total_expense', 'surplus_loss', 'Net_Assets', 'Highest_Compensation' with median
for col in ['total_expense', 'surplus_loss', 'Net_Assets', 'Highest_Compensation', 'Donor_Dependency']:
if df[col].isnull().any():
median_val = df[col].median()
df[col].fillna(median_val, inplace=True)
print("\nMissing values after handling all NaNs:")
print(df.isnull().sum())Missing values after handling all NaNs: rank 0 name 0 category 0 private donations 0 total revenue 0 fundraising efficiency % 0 charitable commitment % 0 headquarter 0 total_expense 0 tot_exp_char_service 0 tot_exp_management & general 0 tot_exp_fundraising 0 surplus_loss 0 Net_Assets 0 Donor_Dependency 0 Highest_Compensation 0 dtype: int64
/tmp/ipykernel_11995/3385659435.py:12: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df[col].fillna(0, inplace=True)
/tmp/ipykernel_11995/3385659435.py:23: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df[col].fillna(median_val, inplace=True)
Data Cleaning: Final Check
# Display the first few rows of the cleaned DataFrame
print("\nDataFrame head after cleaning:")
print(df.head())
# Display data types of the cleaned DataFrame
print("\nDataFrame dtypes after cleaning:")
print(df.dtypes)DataFrame head after cleaning: rank name category \ 0 1 Feeding America Domestic Needs 1 2 Good 360 International Needs 2 3 United Way Worldwide Domestic Needs 3 4 St. Jude Children's Research Hospital Medical 4 5 Direct Relief International Needs private donations total revenue fundraising efficiency % \ 0 4.270000e+09 4.360000e+09 99 1 2.670000e+09 2.690000e+09 100 2 2.600000e+09 3.510000e+09 88 3 2.460000e+09 3.300000e+09 84 4 2.260000e+09 2.270000e+09 100 charitable commitment % headquarter total_expense \ 0 98 Chicago, Illinois 4.300000e+09 1 100 Alexandria, Virginia 2.700000e+09 2 85 Alexandria, Virginia 3.500000e+09 3 70 Memphis, Tennessee 2.400000e+09 4 99 Santa Barbara, California 2.100000e+09 tot_exp_char_service tot_exp_management & general tot_exp_fundraising \ 0 4.300000e+09 15000000.0 63000000.0 1 2.700000e+09 3000000.0 2000000.0 2 3.000000e+09 234000000.0 308000000.0 3 1.600000e+09 318000000.0 404000000.0 4 2.100000e+09 9000000.0 3000000.0 surplus_loss Net_Assets Donor_Dependency Highest_Compensation 0 13000000.0 5.780000e+08 100.0 959665.0 1 19000000.0 7.300000e+07 99.0 257551.0 2 13000000.0 4.290000e+08 100.0 3622146.0 3 926000000.0 1.020000e+10 62.0 1546372.0 4 198000000.0 1.100000e+09 91.0 652610.0 DataFrame dtypes after cleaning: rank int64 name object category object private donations float64 total revenue float64 fundraising efficiency % int64 charitable commitment % int64 headquarter object total_expense float64 tot_exp_char_service float64 tot_exp_management & general float64 tot_exp_fundraising float64 surplus_loss float64 Net_Assets float64 Donor_Dependency float64 Highest_Compensation float64 dtype: object
- Create Visualizations: Create at least 2 different types of visualizations that help answer your research question. Choose from:
- Bar charts
- Line plots
- Scatter plots
- Histograms
- Box plots
- Heatmaps
- Any other appropriate visualization
- Customize Your Plots: Make sure your visualizations are clear and informative by:
- Adding appropriate titles
- Labeling axes
- Using colors effectively
- Adding legends when necessary
Please create code cells below this to create your visualizations.
plt.figure(figsize=(10, 6))
sns.histplot(df['fundraising efficiency %'], bins=10, kde=True, color='skyblue')
plt.title('Distribution of Fundraising Efficiency %')
plt.xlabel('Fundraising Efficiency (%)')
plt.ylabel('Number of Charities')
plt.grid(axis='y', alpha=0.75)
plt.tight_layout()
plt.show()plt.figure(figsize=(10, 6))
sns.boxplot(x=df['fundraising efficiency %'], color='lightgreen')
plt.title('Box Plot of Fundraising Efficiency %')
plt.xlabel('Fundraising Efficiency (%)')
plt.grid(axis='x', alpha=0.75)
plt.tight_layout()
plt.show()import matplotlib.pyplot as plt
import seaborn as sns
# Create a bar chart using the already calculated category_efficiency
plt.figure(figsize=(12, 7))
sns.barplot(x=category_efficiency.index, y=category_efficiency.values, hue=category_efficiency.index, palette='viridis', legend=False)
plt.title('Average Fundraising Efficiency by Nonprofit Category')
plt.xlabel('Nonprofit Category')
plt.ylabel('Average Fundraising Efficiency (%)')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()# Create a bar chart for average highest compensation by category
plt.figure(figsize=(12, 7))
sns.barplot(x=category_highest_compensation.index, y=category_highest_compensation.values, hue=category_highest_compensation.index, palette='viridis', legend=False)
plt.title('Average Highest Compensation by Nonprofit Category')
plt.xlabel('Nonprofit Category')
plt.ylabel('Average Highest Compensation ($)')
plt.xticks(rotation=45, ha='right')
plt.ticklabel_format(style='plain', axis='y') # Prevent scientific notation on y-axis for compensation
plt.tight_layout()
plt.show()# Create a scatter plot to visualize fundraising efficiency against donor dependency
plt.figure(figsize=(12, 7))
sns.scatterplot(x=df['Donor_Dependency'], y=df['fundraising efficiency %'], hue=df['category'], palette='viridis', s=100)
plt.title('Fundraising Efficiency vs. Donor Dependency by Category')
plt.xlabel('Donor Dependency (%)')
plt.ylabel('Fundraising Efficiency (%)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()plt.figure(figsize=(12, 7))
sns.scatterplot(x=df['Highest_Compensation'], y=df['fundraising efficiency %'], hue=df['category'], palette='viridis', s=100)
plt.title('Fundraising Efficiency vs. Highest Compensation by Category')
plt.xlabel('Highest Compensation ($)')
plt.ylabel('Fundraising Efficiency (%)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.ticklabel_format(style='plain', axis='x') # Prevent scientific notation on x-axis for compensation
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()- Analysis: Write 1-2 paragraphs discussing:
- Whether your research question can be answered with the available data
- What insights you discovered through your visualizations
- Any unexpected or interesting findings
- Limitations of your analysis
- Suggestions for further investigation
Please create a text cell below this cell to write your analysis.
My research question cannot be fully answered with the available data, but my exploration did yield some interesting insights. This data is collected on Forbes top 100 nonprofits, and these organizations had a high rate of fundraising efficiency, with a mean near 90%. The standard deviation was about 7 percentage points, indicating that there is variability, but that most organizations in this study were close to the mean.
I expected that higher donor dependence would be correllated with greater fundraising efficiency, but this did not appear to be the case. I also compared highest compensation to fundraising efficiency, expecting that higher compensation may predict a looser spending policy and reduced efficiency. Surprisingly, some of the organizations with higher-than-average compensation still saw high levels of efficiency, while others with lower compensation dipped below 70% efficiency.
My analysis is limited by the study group, Forbes top 100 nonprofits. These high-performers may have attributes in common that bias fundraising efficiency and obscure important variables. Further research might look at a broader selection of organizations to determine the predictors of fundraising efficiency.
Submission Requirements
NOTE: If you use a local dataset, you MUST submit the dataset along with your notebook to get credit.
- Download your notebook as CS110_Project6.ipynb
- Ensure all TODO tasks are completed
- Make sure your code runs without errors
- Include clear comments explaining your work
- Submit the notebook file
Grading Criteria (5 points)
- Dataset Selection: Appropriate choice of dataset with sufficient complexity (0.5 point)
- Inquiry Question: Clear, specific, and answerable with the chosen data (1 point)
- Data Exploration: Proper use of pandas functions for initial analysis (0.5 point)
- Visualizations: At least 2 clear, well-formatted, and relevant plots (1 point)
- Analysis: Thoughtful interpretation of findings and answer to research question (1 point)
- Code Quality: Clean, commented, and well-organized code (1 point)
Remember: The goal is not just to create visualizations, but to use them to gain meaningful insights about your data. Think like a data scientist and let your curiosity guide your exploration!