Home

Charity Data Exploration - Full Notebook

Full notebook export with code cells and saved outputs.

Lorissa Hughes

Project 6: Data Exploration and Visualization

Overview

In this project, you will apply data exploration and visualization techniques to a real-world dataset of your choice. You'll practice using pandas for data analysis and create meaningful visualizations to gain insights from your data. This project builds on your Python programming skills while introducing you to the exciting world of data science.

In Lab6, you explore the titanic dataset and you should use it as a sample and pick a dataset of your interest to do a similar data exploration and visualization.

Project Objectives

Data Sources

You can choose from various online data sources such as:

Part 1: Dataset Selection and Initial Exploration

TODO Tasks:

  1. Select Your Dataset: Choose a dataset that interests you. Make sure it has at least 100 rows and multiple columns with different data types (e.g., numerical, categorical, dates).

Please create either text or code cells in between the TODO tasks to do the work.

AMERICA’S TOP 100 CHARITIES: https://www.kaggle.com/datasets/gingfrecs03/americas-top-100-charities$0

  1. Import Libraries: Import the necessary libraries (pandas, matplotlib, seaborn) at the beginning of your notebook by creating code cells below this text cell.
Code Cell 1
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
  1. Load Your Data: Load your chosen dataset using pandas and save it to a pandas dataframe. Please create a code cell below this text cell to do it.
Code Cell 2
import kagglehub

# Download latest version
path = kagglehub.dataset_download("gingfrecs03/americas-top-100-charities")

print("Path to dataset files:", path)
Using Colab cache for faster access to the 'americas-top-100-charities' dataset.
Path to dataset files: /kaggle/input/americas-top-100-charities
Code Cell 3
import os

# Construct the full path to the CSV file
csv_file_path = os.path.join(path, "us_charity_forbes_data.csv")

# Load the CSV into a pandas DataFrame
df = pd.read_csv(csv_file_path)

# Display the first 5 rows of the DataFrame
print(df.head())
   rank                                   name             category  \
0     1                        Feeding America       Domestic Needs   
1     2                               Good 360  International Needs   
2     3                   United Way Worldwide       Domestic Needs   
3     4  St. Jude Children's Research Hospital              Medical   
4     5                          Direct Relief  International Needs   

  private donations total revenue  fundraising efficiency %  \
0           $4.27 B       $4.36 B                        99   
1           $2.67 B       $2.69 B                       100   
2            $2.6 B       $3.51 B                        88   
3           $2.46 B        $3.3 B                        84   
4           $2.26 B       $2.27 B                       100   

   charitable commitment %                headquarter total_expense  \
0                       98          Chicago, Illinois         $4.3B   
1                      100       Alexandria, Virginia         $2.7B   
2                       85       Alexandria, Virginia         $3.5B   
3                       70         Memphis, Tennessee         $2.4B   
4                       99  Santa Barbara, California         $2.1B   

  tot_exp_char_service tot_exp_management & general tot_exp_fundraising  \
0                $4.3B                         $15M                $63M   
1                $2.7B                          $3M                 $2M   
2                  $3B                        $234M               $308M   
3                $1.6B                        $318M               $404M   
4                $2.1B                          $9M                 $3M   

  surplus_loss Net_Assets Donor_Dependency Highest_Compensation  
0         $13M      $578M             100%             $959,665  
1         $19M       $73M              99%             $257,551  
2           Na         Na            100%S           $3,622,146  
3        $926M     $10.2B              62%           $1,546,372  
4        $198M      $1.1B              91%             $652,610  
  1. Data Summary: Use the following functions to do inital exploration your dataset:

You may use 5 code cells below this text cell to accomplish it.

Code Cell 4
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   rank                          100 non-null    int64 
 1   name                          100 non-null    object
 2   category                      100 non-null    object
 3   private donations             100 non-null    object
 4   total revenue                 100 non-null    object
 5   fundraising efficiency %      100 non-null    int64 
 6   charitable commitment %       100 non-null    int64 
 7   headquarter                   100 non-null    object
 8   total_expense                 100 non-null    object
 9   tot_exp_char_service          99 non-null     object
 10  tot_exp_management & general  99 non-null     object
 11  tot_exp_fundraising           98 non-null     object
 12  surplus_loss                  100 non-null    object
 13  Net_Assets                    100 non-null    object
 14  Donor_Dependency              100 non-null    object
 15  Highest_Compensation          100 non-null    object
dtypes: int64(3), object(13)
memory usage: 12.6+ KB
None
Code Cell 5
print(df.describe())
             rank  fundraising efficiency %  charitable commitment %
count  100.000000                100.000000               100.000000
mean    50.500000                 91.390000                87.080000
std     29.011492                  7.027931                 8.662796
min      1.000000                 67.000000                58.000000
25%     25.750000                 86.000000                81.000000
50%     50.500000                 93.000000                88.000000
75%     75.250000                 97.000000                94.000000
max    100.000000                100.000000               100.000000
Code Cell 6
print(df.head())
   rank                                   name             category  \
0     1                        Feeding America       Domestic Needs   
1     2                               Good 360  International Needs   
2     3                   United Way Worldwide       Domestic Needs   
3     4  St. Jude Children's Research Hospital              Medical   
4     5                          Direct Relief  International Needs   

  private donations total revenue  fundraising efficiency %  \
0           $4.27 B       $4.36 B                        99   
1           $2.67 B       $2.69 B                       100   
2            $2.6 B       $3.51 B                        88   
3           $2.46 B        $3.3 B                        84   
4           $2.26 B       $2.27 B                       100   

   charitable commitment %                headquarter total_expense  \
0                       98          Chicago, Illinois         $4.3B   
1                      100       Alexandria, Virginia         $2.7B   
2                       85       Alexandria, Virginia         $3.5B   
3                       70         Memphis, Tennessee         $2.4B   
4                       99  Santa Barbara, California         $2.1B   

  tot_exp_char_service tot_exp_management & general tot_exp_fundraising  \
0                $4.3B                         $15M                $63M   
1                $2.7B                          $3M                 $2M   
2                  $3B                        $234M               $308M   
3                $1.6B                        $318M               $404M   
4                $2.1B                          $9M                 $3M   

  surplus_loss Net_Assets Donor_Dependency Highest_Compensation  
0         $13M      $578M             100%             $959,665  
1         $19M       $73M              99%             $257,551  
2           Na         Na            100%S           $3,622,146  
3        $926M     $10.2B              62%           $1,546,372  
4        $198M      $1.1B              91%             $652,610  
Code Cell 7
print(df.shape)
(100, 16)
Code Cell 8
print(df.columns)
Index(['rank', 'name', 'category', 'private donations', 'total revenue',
       'fundraising efficiency %', 'charitable commitment %', 'headquarter',
       'total_expense', 'tot_exp_char_service', 'tot_exp_management & general',
       'tot_exp_fundraising', 'surplus_loss', 'Net_Assets', 'Donor_Dependency',
       'Highest_Compensation'],
      dtype='object')
  1. Inquiry Question: Write a clear, specific question that you want to answer using your dataset by creating a text cell below and write the question in bold please.

For example:

What factors are associated with efficient fundraising?

Part 2: Data Visualization and Analysis

TODO Tasks:

  1. Data Cleaning (if necessary): Check for and handle any missing values, duplicates, or data quality issues.

NOTE: This could be a daunting task even for seasoned data scientists. My suggestion is to use a dataset that is already clean if you can for the purpose of completing this assignment in a week.

Examples:

Data Cleaning: Checking Missing Values and Duplicates

Code Cell 9
# Check for missing values
print("Missing values before cleaning:")
print(df.isnull().sum())
Missing values before cleaning:
rank                            0
name                            0
category                        0
private donations               0
total revenue                   0
fundraising efficiency %        0
charitable commitment %         0
headquarter                     0
total_expense                   0
tot_exp_char_service            1
tot_exp_management & general    1
tot_exp_fundraising             2
surplus_loss                    0
Net_Assets                      0
Donor_Dependency                0
Highest_Compensation            0
dtype: int64
Code Cell 10
# Check for duplicate rows
print("\nNumber of duplicate rows:", df.duplicated().sum())
Number of duplicate rows: 0

Data Cleaning: Converting Financial Columns to Numeric

Code Cell 11
def clean_and_convert_financial(series):
    # Replace 'Na' with NaN
    series = series.replace('Na', float('NaN'))
    # Remove '$', ',', ' ' and '%' signs
    series = series.astype(str).str.replace('$', '', regex=False).str.replace(',', '', regex=False).str.replace(' ', '', regex=False)

    # Handle 'B' (billions), 'M' (millions)
    def convert_to_numeric(value):
        if isinstance(value, str):
            value = value.upper()
            if 'B' in value: # Billions
                return float(value.replace('B', '')) * 1_000_000_000
            elif 'M' in value: # Millions
                return float(value.replace('M', '')) * 1_000_000
            elif '%' in value: # Percentages (for Donor_Dependency and others)
                return float(value.replace('%', '')) # Keep as whole number percentage for now
        try:
            return float(value)
        except (ValueError, TypeError):
            return float('NaN')

    return series.apply(convert_to_numeric)
Code Cell 12
# List of financial columns to clean and convert
financial_columns = [
    'private donations',
    'total revenue',
    'total_expense',
    'tot_exp_char_service',
    'tot_exp_management & general',
    'tot_exp_fundraising',
    'surplus_loss',
    'Net_Assets',
    'Highest_Compensation'
]

# Apply the cleaning function to each financial column
for col in financial_columns:
    df[col] = clean_and_convert_financial(df[col])

# The 'Donor_Dependency' column needs special handling for non-numeric characters like '100%S'
# Remove all non-numeric characters except for '.' before converting to float
df['Donor_Dependency'] = df['Donor_Dependency'].astype(str).str.replace(r'[^0-9.]', '', regex=True).replace('', float('NaN')).astype(float)

# Re-check df.info() to see the updated dtypes and missing values after conversion
print("\nDataFrame info after converting financial columns:")
print(df.info())

# Display min and max values for the cleaned financial columns and Donor_Dependency
print("\nMin and Max values for financial columns:")
all_cleaned_numeric_cols = financial_columns + ['Donor_Dependency']
for col in all_cleaned_numeric_cols:
    if col in df.columns and pd.api.types.is_numeric_dtype(df[col]):
        print(f"{col}: Min={df[col].min():,.2f}, Max={df[col].max():,.2f}")
DataFrame info after converting financial columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   rank                          100 non-null    int64  
 1   name                          100 non-null    object 
 2   category                      100 non-null    object 
 3   private donations             100 non-null    float64
 4   total revenue                 100 non-null    float64
 5   fundraising efficiency %      100 non-null    int64  
 6   charitable commitment %       100 non-null    int64  
 7   headquarter                   100 non-null    object 
 8   total_expense                 99 non-null     float64
 9   tot_exp_char_service          99 non-null     float64
 10  tot_exp_management & general  99 non-null     float64
 11  tot_exp_fundraising           98 non-null     float64
 12  surplus_loss                  97 non-null     float64
 13  Net_Assets                    97 non-null     float64
 14  Donor_Dependency              99 non-null     float64
 15  Highest_Compensation          99 non-null     float64
dtypes: float64(10), int64(3), object(3)
memory usage: 12.6+ KB
None

Min and Max values for financial columns:
private donations: Min=186,000,000.00, Max=4,270,000,000.00
total revenue: Min=-474,000,000.00, Max=28,560,000,000.00
total_expense: Min=60,000,000.00, Max=28,000,000,000.00
tot_exp_char_service: Min=35,000,000.00, Max=22,700,000,000.00
tot_exp_management & general: Min=1,000,000.00, Max=5,300,000,000.00
tot_exp_fundraising: Min=1,000,000.00, Max=404,000,000.00
surplus_loss: Min=-1,700,000,000.00, Max=2,200,000,000.00
Net_Assets: Min=32,000,000.00, Max=19,900,000,000.00
Highest_Compensation: Min=73,940.00, Max=9,480,164.00
Donor_Dependency: Min=20.00, Max=780.00

Data Cleaning: Handling Remaining Missing Values

Code Cell 13
# For the purpose of this analysis, we will fill the missing expense values with 0,
# assuming 'NaN' implies no expenditure in that category for that charity.
# Alternatively, we could fill with mean/median or drop rows if more rigorous imputation is needed.

columns_to_fill_na = [
    'tot_exp_char_service',
    'tot_exp_management & general',
    'tot_exp_fundraising'
]

for col in columns_to_fill_na:
    df[col].fillna(0, inplace=True)

# For other columns like 'total_expense', 'surplus_loss', 'Net_Assets', 'Highest_Compensation',
# and 'Donor_Dependency' (if any NaNs remain from conversion issues)
# we'll fill with the median for a more robust approach, or 0 if it makes sense.
# Let's check remaining NaNs for these first.

# Fill remaining NaNs in 'total_expense', 'surplus_loss', 'Net_Assets', 'Highest_Compensation' with median
for col in ['total_expense', 'surplus_loss', 'Net_Assets', 'Highest_Compensation', 'Donor_Dependency']:
    if df[col].isnull().any():
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)

print("\nMissing values after handling all NaNs:")
print(df.isnull().sum())
Missing values after handling all NaNs:
rank                            0
name                            0
category                        0
private donations               0
total revenue                   0
fundraising efficiency %        0
charitable commitment %         0
headquarter                     0
total_expense                   0
tot_exp_char_service            0
tot_exp_management & general    0
tot_exp_fundraising             0
surplus_loss                    0
Net_Assets                      0
Donor_Dependency                0
Highest_Compensation            0
dtype: int64
/tmp/ipykernel_11995/3385659435.py:12: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(0, inplace=True)
/tmp/ipykernel_11995/3385659435.py:23: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(median_val, inplace=True)

Data Cleaning: Final Check

Code Cell 14
# Display the first few rows of the cleaned DataFrame
print("\nDataFrame head after cleaning:")
print(df.head())

# Display data types of the cleaned DataFrame
print("\nDataFrame dtypes after cleaning:")
print(df.dtypes)
DataFrame head after cleaning:
   rank                                   name             category  \
0     1                        Feeding America       Domestic Needs   
1     2                               Good 360  International Needs   
2     3                   United Way Worldwide       Domestic Needs   
3     4  St. Jude Children's Research Hospital              Medical   
4     5                          Direct Relief  International Needs   

   private donations  total revenue  fundraising efficiency %  \
0       4.270000e+09   4.360000e+09                        99   
1       2.670000e+09   2.690000e+09                       100   
2       2.600000e+09   3.510000e+09                        88   
3       2.460000e+09   3.300000e+09                        84   
4       2.260000e+09   2.270000e+09                       100   

   charitable commitment %                headquarter  total_expense  \
0                       98          Chicago, Illinois   4.300000e+09   
1                      100       Alexandria, Virginia   2.700000e+09   
2                       85       Alexandria, Virginia   3.500000e+09   
3                       70         Memphis, Tennessee   2.400000e+09   
4                       99  Santa Barbara, California   2.100000e+09   

   tot_exp_char_service  tot_exp_management & general  tot_exp_fundraising  \
0          4.300000e+09                    15000000.0           63000000.0   
1          2.700000e+09                     3000000.0            2000000.0   
2          3.000000e+09                   234000000.0          308000000.0   
3          1.600000e+09                   318000000.0          404000000.0   
4          2.100000e+09                     9000000.0            3000000.0   

   surplus_loss    Net_Assets  Donor_Dependency  Highest_Compensation  
0    13000000.0  5.780000e+08             100.0              959665.0  
1    19000000.0  7.300000e+07              99.0              257551.0  
2    13000000.0  4.290000e+08             100.0             3622146.0  
3   926000000.0  1.020000e+10              62.0             1546372.0  
4   198000000.0  1.100000e+09              91.0              652610.0  

DataFrame dtypes after cleaning:
rank                              int64
name                             object
category                         object
private donations               float64
total revenue                   float64
fundraising efficiency %          int64
charitable commitment %           int64
headquarter                      object
total_expense                   float64
tot_exp_char_service            float64
tot_exp_management & general    float64
tot_exp_fundraising             float64
surplus_loss                    float64
Net_Assets                      float64
Donor_Dependency                float64
Highest_Compensation            float64
dtype: object
  1. Create Visualizations: Create at least 2 different types of visualizations that help answer your research question. Choose from:
  1. Customize Your Plots: Make sure your visualizations are clear and informative by:

Please create code cells below this to create your visualizations.

Code Cell 15
plt.figure(figsize=(10, 6))
sns.histplot(df['fundraising efficiency %'], bins=10, kde=True, color='skyblue')
plt.title('Distribution of Fundraising Efficiency %')
plt.xlabel('Fundraising Efficiency (%)')
plt.ylabel('Number of Charities')
plt.grid(axis='y', alpha=0.75)
plt.tight_layout()
plt.show()
Notebook output plot
Code Cell 16
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['fundraising efficiency %'], color='lightgreen')
plt.title('Box Plot of Fundraising Efficiency %')
plt.xlabel('Fundraising Efficiency (%)')
plt.grid(axis='x', alpha=0.75)
plt.tight_layout()
plt.show()
Notebook output plot
Code Cell 17
import matplotlib.pyplot as plt
import seaborn as sns

# Create a bar chart using the already calculated category_efficiency
plt.figure(figsize=(12, 7))
sns.barplot(x=category_efficiency.index, y=category_efficiency.values, hue=category_efficiency.index, palette='viridis', legend=False)
plt.title('Average Fundraising Efficiency by Nonprofit Category')
plt.xlabel('Nonprofit Category')
plt.ylabel('Average Fundraising Efficiency (%)')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()
Notebook output plot
Code Cell 18
# Create a bar chart for average highest compensation by category
plt.figure(figsize=(12, 7))
sns.barplot(x=category_highest_compensation.index, y=category_highest_compensation.values, hue=category_highest_compensation.index, palette='viridis', legend=False)
plt.title('Average Highest Compensation by Nonprofit Category')
plt.xlabel('Nonprofit Category')
plt.ylabel('Average Highest Compensation ($)')
plt.xticks(rotation=45, ha='right')
plt.ticklabel_format(style='plain', axis='y') # Prevent scientific notation on y-axis for compensation
plt.tight_layout()
plt.show()
Notebook output plot
Code Cell 19
# Create a scatter plot to visualize fundraising efficiency against donor dependency
plt.figure(figsize=(12, 7))
sns.scatterplot(x=df['Donor_Dependency'], y=df['fundraising efficiency %'], hue=df['category'], palette='viridis', s=100)
plt.title('Fundraising Efficiency vs. Donor Dependency by Category')
plt.xlabel('Donor Dependency (%)')
plt.ylabel('Fundraising Efficiency (%)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Notebook output plot
Code Cell 20
plt.figure(figsize=(12, 7))
sns.scatterplot(x=df['Highest_Compensation'], y=df['fundraising efficiency %'], hue=df['category'], palette='viridis', s=100)
plt.title('Fundraising Efficiency vs. Highest Compensation by Category')
plt.xlabel('Highest Compensation ($)')
plt.ylabel('Fundraising Efficiency (%)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.ticklabel_format(style='plain', axis='x') # Prevent scientific notation on x-axis for compensation
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Notebook output plot
  1. Analysis: Write 1-2 paragraphs discussing:

Please create a text cell below this cell to write your analysis.

My research question cannot be fully answered with the available data, but my exploration did yield some interesting insights. This data is collected on Forbes top 100 nonprofits, and these organizations had a high rate of fundraising efficiency, with a mean near 90%. The standard deviation was about 7 percentage points, indicating that there is variability, but that most organizations in this study were close to the mean.

I expected that higher donor dependence would be correllated with greater fundraising efficiency, but this did not appear to be the case. I also compared highest compensation to fundraising efficiency, expecting that higher compensation may predict a looser spending policy and reduced efficiency. Surprisingly, some of the organizations with higher-than-average compensation still saw high levels of efficiency, while others with lower compensation dipped below 70% efficiency.

My analysis is limited by the study group, Forbes top 100 nonprofits. These high-performers may have attributes in common that bias fundraising efficiency and obscure important variables. Further research might look at a broader selection of organizations to determine the predictors of fundraising efficiency.

Submission Requirements

NOTE: If you use a local dataset, you MUST submit the dataset along with your notebook to get credit.

Grading Criteria (5 points)

Remember: The goal is not just to create visualizations, but to use them to gain meaningful insights about your data. Think like a data scientist and let your curiosity guide your exploration!