Lorissa Hughes

Project 6: Data Exploration and Visualization

Overview

In this project, you will apply data exploration and visualization techniques to a real-world dataset of your choice. You'll practice using pandas for data analysis and create meaningful visualizations to gain insights from your data. This project builds on your Python programming skills while introducing you to the exciting world of data science.

In Lab6, you explore the titanic dataset and you should use it as a sample and pick a dataset of your interest to do a similar data exploration and visualization.

Project Objectives

Find and select a dataset from a real-world data source
Formulate a clear inquiry question based on your chosen dataset

Perform exploratory data analysis using pandas functions
Create at least two meaningful data visualizations

Analyze your findings and answer your inquiry question
Reflect on the insights gained through data exploration

Data Sources

You can choose from various online data sources such as:

Kaggle - Wide variety of datasets
data.gov - U.S. government data
Our World in Data - Global development data
GitHub - Curated datasets
Google Dataset Search - Search for datasets
Or any other reliable data source you find interesting

Part 1: Dataset Selection and Initial Exploration

TODO Tasks:

Select Your Dataset: Choose a dataset that interests you. Make sure it has at least 100 rows and multiple columns with different data types (e.g., numerical, categorical, dates).

Please create either text or code cells in between the TODO tasks to do the work.

AMERICA’S TOP 100 CHARITIES: https://www.kaggle.com/datasets/gingfrecs03/americas-top-100-charities$0

Import Libraries: Import the necessary libraries (pandas, matplotlib, seaborn) at the beginning of your notebook by creating code cells below this text cell.

Code Cell 1

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load Your Data: Load your chosen dataset using pandas and save it to a pandas dataframe. Please create a code cell below this text cell to do it.

Code Cell 2

import kagglehub

# Download latest version
path = kagglehub.dataset_download("gingfrecs03/americas-top-100-charities")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'americas-top-100-charities' dataset.
Path to dataset files: /kaggle/input/americas-top-100-charities

Code Cell 3

import os

# Construct the full path to the CSV file
csv_file_path = os.path.join(path, "us_charity_forbes_data.csv")

# Load the CSV into a pandas DataFrame
df = pd.read_csv(csv_file_path)

# Display the first 5 rows of the DataFrame
print(df.head())

   rank                                   name             category  \
0     1                        Feeding America       Domestic Needs   
1     2                               Good 360  International Needs   
2     3                   United Way Worldwide       Domestic Needs   
3     4  St. Jude Children's Research Hospital              Medical   
4     5                          Direct Relief  International Needs   

  private donations total revenue  fundraising efficiency %  \
0           $4.27 B       $4.36 B                        99   
1           $2.67 B       $2.69 B                       100   
2            $2.6 B       $3.51 B                        88   
3           $2.46 B        $3.3 B                        84   
4           $2.26 B       $2.27 B                       100   

   charitable commitment %                headquarter total_expense  \
0                       98          Chicago, Illinois         $4.3B   
1                      100       Alexandria, Virginia         $2.7B   
2                       85       Alexandria, Virginia         $3.5B   
3                       70         Memphis, Tennessee         $2.4B   
4                       99  Santa Barbara, California         $2.1B   

  tot_exp_char_service tot_exp_management & general tot_exp_fundraising  \
0                $4.3B                         $15M                $63M   
1                $2.7B                          $3M                 $2M   
2                  $3B                        $234M               $308M   
3                $1.6B                        $318M               $404M   
4                $2.1B                          $9M                 $3M   

  surplus_loss Net_Assets Donor_Dependency Highest_Compensation  
0         $13M      $578M             100%             $959,665  
1         $19M       $73M              99%             $257,551  
2           Na         Na            100%S           $3,622,146  
3        $926M     $10.2B              62%           $1,546,372  
4        $198M      $1.1B              91%             $652,610

Data Summary: Use the following functions to do inital exploration your dataset:

info() - Display information about the dataset structure
describe() - Show statistical summary of numerical columns
head() - Display the first 5 rows (df.head())
shape - Show the dimensions of the dataset (e.g. df.shape)
columns - Get the column names (e.g. df.columns)

You may use 5 code cells below this text cell to accomplish it.

Code Cell 4

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   rank                          100 non-null    int64 
 1   name                          100 non-null    object
 2   category                      100 non-null    object
 3   private donations             100 non-null    object
 4   total revenue                 100 non-null    object
 5   fundraising efficiency %      100 non-null    int64 
 6   charitable commitment %       100 non-null    int64 
 7   headquarter                   100 non-null    object
 8   total_expense                 100 non-null    object
 9   tot_exp_char_service          99 non-null     object
 10  tot_exp_management & general  99 non-null     object
 11  tot_exp_fundraising           98 non-null     object
 12  surplus_loss                  100 non-null    object
 13  Net_Assets                    100 non-null    object
 14  Donor_Dependency              100 non-null    object
 15  Highest_Compensation          100 non-null    object
dtypes: int64(3), object(13)
memory usage: 12.6+ KB
None

Code Cell 5

print(df.describe())

             rank  fundraising efficiency %  charitable commitment %
count  100.000000                100.000000               100.000000
mean    50.500000                 91.390000                87.080000
std     29.011492                  7.027931                 8.662796
min      1.000000                 67.000000                58.000000
25%     25.750000                 86.000000                81.000000
50%     50.500000                 93.000000                88.000000
75%     75.250000                 97.000000                94.000000
max    100.000000                100.000000               100.000000

Code Cell 6

print(df.head())

   rank                                   name             category  \
0     1                        Feeding America       Domestic Needs   
1     2                               Good 360  International Needs   
2     3                   United Way Worldwide       Domestic Needs   
3     4  St. Jude Children's Research Hospital              Medical   
4     5                          Direct Relief  International Needs   

  private donations total revenue  fundraising efficiency %  \
0           $4.27 B       $4.36 B                        99   
1           $2.67 B       $2.69 B                       100   
2            $2.6 B       $3.51 B                        88   
3           $2.46 B        $3.3 B                        84   
4           $2.26 B       $2.27 B                       100   

   charitable commitment %                headquarter total_expense  \
0                       98          Chicago, Illinois         $4.3B   
1                      100       Alexandria, Virginia         $2.7B   
2                       85       Alexandria, Virginia         $3.5B   
3                       70         Memphis, Tennessee         $2.4B   
4                       99  Santa Barbara, California         $2.1B   

  tot_exp_char_service tot_exp_management & general tot_exp_fundraising  \
0                $4.3B                         $15M                $63M   
1                $2.7B                          $3M                 $2M   
2                  $3B                        $234M               $308M   
3                $1.6B                        $318M               $404M   
4                $2.1B                          $9M                 $3M   

  surplus_loss Net_Assets Donor_Dependency Highest_Compensation  
0         $13M      $578M             100%             $959,665  
1         $19M       $73M              99%             $257,551  
2           Na         Na            100%S           $3,622,146  
3        $926M     $10.2B              62%           $1,546,372  
4        $198M      $1.1B              91%             $652,610

Code Cell 7

print(df.shape)

(100, 16)

Code Cell 8

print(df.columns)

Index(['rank', 'name', 'category', 'private donations', 'total revenue',
       'fundraising efficiency %', 'charitable commitment %', 'headquarter',
       'total_expense', 'tot_exp_char_service', 'tot_exp_management & general',
       'tot_exp_fundraising', 'surplus_loss', 'Net_Assets', 'Donor_Dependency',
       'Highest_Compensation'],
      dtype='object')

Inquiry Question: Write a clear, specific question that you want to answer using your dataset by creating a text cell below and write the question in bold please.

For example:

"How has the average temperature changed in major cities over the past decade?"

"What factors are most strongly correlated with housing prices?"

"Is there a relationship between education level and income?"

What factors are associated with efficient fundraising?

Part 2: Data Visualization and Analysis

TODO Tasks:

Data Cleaning (if necessary): Check for and handle any missing values, duplicates, or data quality issues.

NOTE: This could be a daunting task even for seasoned data scientists. My suggestion is to use a dataset that is already clean if you can for the purpose of completing this assignment in a week.

Examples:

Checking for missing values: df.isnull().sum()
Dropping rows with any missing values: df.dropna(inplace=True)
Dropping columns with all missing values: df.dropna(axis=1, how='all', inplace=True)
Removing duplicate row: df.drop_duplicates(inplace=True)

Data Cleaning: Checking Missing Values and Duplicates

Code Cell 9

# Check for missing values
print("Missing values before cleaning:")
print(df.isnull().sum())

Missing values before cleaning:
rank                            0
name                            0
category                        0
private donations               0
total revenue                   0
fundraising efficiency %        0
charitable commitment %         0
headquarter                     0
total_expense                   0
tot_exp_char_service            1
tot_exp_management & general    1
tot_exp_fundraising             2
surplus_loss                    0
Net_Assets                      0
Donor_Dependency                0
Highest_Compensation            0
dtype: int64

Code Cell 10

# Check for duplicate rows
print("\nNumber of duplicate rows:", df.duplicated().sum())

Number of duplicate rows: 0

Data Cleaning: Converting Financial Columns to Numeric

Code Cell 11

def clean_and_convert_financial(series):
    # Replace 'Na' with NaN
    series = series.replace('Na', float('NaN'))
    # Remove '$', ',', ' ' and '%' signs
    series = series.astype(str).str.replace('$', '', regex=False).str.replace(',', '', regex=False).str.replace(' ', '', regex=False)

    # Handle 'B' (billions), 'M' (millions)
    def convert_to_numeric(value):
        if isinstance(value, str):
            value = value.upper()
            if 'B' in value: # Billions
                return float(value.replace('B', '')) * 1_000_000_000
            elif 'M' in value: # Millions
                return float(value.replace('M', '')) * 1_000_000
            elif '%' in value: # Percentages (for Donor_Dependency and others)
                return float(value.replace('%', '')) # Keep as whole number percentage for now
        try:
            return float(value)
        except (ValueError, TypeError):
            return float('NaN')

    return series.apply(convert_to_numeric)

Code Cell 12

# List of financial columns to clean and convert
financial_columns = [
    'private donations',
    'total revenue',
    'total_expense',
    'tot_exp_char_service',
    'tot_exp_management & general',
    'tot_exp_fundraising',
    'surplus_loss',
    'Net_Assets',
    'Highest_Compensation'
]

# Apply the cleaning function to each financial column
for col in financial_columns:
    df[col] = clean_and_convert_financial(df[col])

# The 'Donor_Dependency' column needs special handling for non-numeric characters like '100%S'
# Remove all non-numeric characters except for '.' before converting to float
df['Donor_Dependency'] = df['Donor_Dependency'].astype(str).str.replace(r'[^0-9.]', '', regex=True).replace('', float('NaN')).astype(float)

# Re-check df.info() to see the updated dtypes and missing values after conversion
print("\nDataFrame info after converting financial columns:")
print(df.info())

# Display min and max values for the cleaned financial columns and Donor_Dependency
print("\nMin and Max values for financial columns:")
all_cleaned_numeric_cols = financial_columns + ['Donor_Dependency']
for col in all_cleaned_numeric_cols:
    if col in df.columns and pd.api.types.is_numeric_dtype(df[col]):
        print(f"{col}: Min={df[col].min():,.2f}, Max={df[col].max():,.2f}")

DataFrame info after converting financial columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   rank                          100 non-null    int64  
 1   name                          100 non-null    object 
 2   category                      100 non-null    object 
 3   private donations             100 non-null    float64
 4   total revenue                 100 non-null    float64
 5   fundraising efficiency %      100 non-null    int64  
 6   charitable commitment %       100 non-null    int64  
 7   headquarter                   100 non-null    object 
 8   total_expense                 99 non-null     float64
 9   tot_exp_char_service          99 non-null     float64
 10  tot_exp_management & general  99 non-null     float64
 11  tot_exp_fundraising           98 non-null     float64
 12  surplus_loss                  97 non-null     float64
 13  Net_Assets                    97 non-null     float64
 14  Donor_Dependency              99 non-null     float64
 15  Highest_Compensation          99 non-null     float64
dtypes: float64(10), int64(3), object(3)
memory usage: 12.6+ KB
None

Min and Max values for financial columns:
private donations: Min=186,000,000.00, Max=4,270,000,000.00
total revenue: Min=-474,000,000.00, Max=28,560,000,000.00
total_expense: Min=60,000,000.00, Max=28,000,000,000.00
tot_exp_char_service: Min=35,000,000.00, Max=22,700,000,000.00
tot_exp_management & general: Min=1,000,000.00, Max=5,300,000,000.00
tot_exp_fundraising: Min=1,000,000.00, Max=404,000,000.00
surplus_loss: Min=-1,700,000,000.00, Max=2,200,000,000.00
Net_Assets: Min=32,000,000.00, Max=19,900,000,000.00
Highest_Compensation: Min=73,940.00, Max=9,480,164.00
Donor_Dependency: Min=20.00, Max=780.00

Data Cleaning: Handling Remaining Missing Values

Code Cell 13

# For the purpose of this analysis, we will fill the missing expense values with 0,
# assuming 'NaN' implies no expenditure in that category for that charity.
# Alternatively, we could fill with mean/median or drop rows if more rigorous imputation is needed.

columns_to_fill_na = [
    'tot_exp_char_service',
    'tot_exp_management & general',
    'tot_exp_fundraising'
]

for col in columns_to_fill_na:
    df[col].fillna(0, inplace=True)

# For other columns like 'total_expense', 'surplus_loss', 'Net_Assets', 'Highest_Compensation',
# and 'Donor_Dependency' (if any NaNs remain from conversion issues)
# we'll fill with the median for a more robust approach, or 0 if it makes sense.
# Let's check remaining NaNs for these first.

# Fill remaining NaNs in 'total_expense', 'surplus_loss', 'Net_Assets', 'Highest_Compensation' with median
for col in ['total_expense', 'surplus_loss', 'Net_Assets', 'Highest_Compensation', 'Donor_Dependency']:
    if df[col].isnull().any():
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)

print("\nMissing values after handling all NaNs:")
print(df.isnull().sum())

Missing values after handling all NaNs:
rank                            0
name                            0
category                        0
private donations               0
total revenue                   0
fundraising efficiency %        0
charitable commitment %         0
headquarter                     0
total_expense                   0
tot_exp_char_service            0
tot_exp_management & general    0
tot_exp_fundraising             0
surplus_loss                    0
Net_Assets                      0
Donor_Dependency                0
Highest_Compensation            0
dtype: int64

/tmp/ipykernel_11995/3385659435.py:12: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(0, inplace=True)
/tmp/ipykernel_11995/3385659435.py:23: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(median_val, inplace=True)

Data Cleaning: Final Check

Code Cell 14

# Display the first few rows of the cleaned DataFrame
print("\nDataFrame head after cleaning:")
print(df.head())

# Display data types of the cleaned DataFrame
print("\nDataFrame dtypes after cleaning:")
print(df.dtypes)

DataFrame head after cleaning:
   rank                                   name             category  \
0     1                        Feeding America       Domestic Needs   
1     2                               Good 360  International Needs   
2     3                   United Way Worldwide       Domestic Needs   
3     4  St. Jude Children's Research Hospital              Medical   
4     5                          Direct Relief  International Needs   

   private donations  total revenue  fundraising efficiency %  \
0       4.270000e+09   4.360000e+09                        99   
1       2.670000e+09   2.690000e+09                       100   
2       2.600000e+09   3.510000e+09                        88   
3       2.460000e+09   3.300000e+09                        84   
4       2.260000e+09   2.270000e+09                       100   

   charitable commitment %                headquarter  total_expense  \
0                       98          Chicago, Illinois   4.300000e+09   
1                      100       Alexandria, Virginia   2.700000e+09   
2                       85       Alexandria, Virginia   3.500000e+09   
3                       70         Memphis, Tennessee   2.400000e+09   
4                       99  Santa Barbara, California   2.100000e+09   

   tot_exp_char_service  tot_exp_management & general  tot_exp_fundraising  \
0          4.300000e+09                    15000000.0           63000000.0   
1          2.700000e+09                     3000000.0            2000000.0   
2          3.000000e+09                   234000000.0          308000000.0   
3          1.600000e+09                   318000000.0          404000000.0   
4          2.100000e+09                     9000000.0            3000000.0   

   surplus_loss    Net_Assets  Donor_Dependency  Highest_Compensation  
0    13000000.0  5.780000e+08             100.0              959665.0  
1    19000000.0  7.300000e+07              99.0              257551.0  
2    13000000.0  4.290000e+08             100.0             3622146.0  
3   926000000.0  1.020000e+10              62.0             1546372.0  
4   198000000.0  1.100000e+09              91.0              652610.0  

DataFrame dtypes after cleaning:
rank                              int64
name                             object
category                         object
private donations               float64
total revenue                   float64
fundraising efficiency %          int64
charitable commitment %           int64
headquarter                      object
total_expense                   float64
tot_exp_char_service            float64
tot_exp_management & general    float64
tot_exp_fundraising             float64
surplus_loss                    float64
Net_Assets                      float64
Donor_Dependency                float64
Highest_Compensation            float64
dtype: object

Create Visualizations: Create at least 2 different types of visualizations that help answer your research question. Choose from:

Bar charts
Line plots
Scatter plots
Histograms
Box plots
Heatmaps
Any other appropriate visualization

Customize Your Plots: Make sure your visualizations are clear and informative by:

Adding appropriate titles
Labeling axes
Using colors effectively
Adding legends when necessary

Please create code cells below this to create your visualizations.

Code Cell 15

plt.figure(figsize=(10, 6))
sns.histplot(df['fundraising efficiency %'], bins=10, kde=True, color='skyblue')
plt.title('Distribution of Fundraising Efficiency %')
plt.xlabel('Fundraising Efficiency (%)')
plt.ylabel('Number of Charities')
plt.grid(axis='y', alpha=0.75)
plt.tight_layout()
plt.show()

Code Cell 16

plt.figure(figsize=(10, 6))
sns.boxplot(x=df['fundraising efficiency %'], color='lightgreen')
plt.title('Box Plot of Fundraising Efficiency %')
plt.xlabel('Fundraising Efficiency (%)')
plt.grid(axis='x', alpha=0.75)
plt.tight_layout()
plt.show()

Code Cell 17

import matplotlib.pyplot as plt
import seaborn as sns

# Create a bar chart using the already calculated category_efficiency
plt.figure(figsize=(12, 7))
sns.barplot(x=category_efficiency.index, y=category_efficiency.values, hue=category_efficiency.index, palette='viridis', legend=False)
plt.title('Average Fundraising Efficiency by Nonprofit Category')
plt.xlabel('Nonprofit Category')
plt.ylabel('Average Fundraising Efficiency (%)')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()

Code Cell 18

# Create a bar chart for average highest compensation by category
plt.figure(figsize=(12, 7))
sns.barplot(x=category_highest_compensation.index, y=category_highest_compensation.values, hue=category_highest_compensation.index, palette='viridis', legend=False)
plt.title('Average Highest Compensation by Nonprofit Category')
plt.xlabel('Nonprofit Category')
plt.ylabel('Average Highest Compensation ($)')
plt.xticks(rotation=45, ha='right')
plt.ticklabel_format(style='plain', axis='y') # Prevent scientific notation on y-axis for compensation
plt.tight_layout()
plt.show()

Code Cell 19

# Create a scatter plot to visualize fundraising efficiency against donor dependency
plt.figure(figsize=(12, 7))
sns.scatterplot(x=df['Donor_Dependency'], y=df['fundraising efficiency %'], hue=df['category'], palette='viridis', s=100)
plt.title('Fundraising Efficiency vs. Donor Dependency by Category')
plt.xlabel('Donor Dependency (%)')
plt.ylabel('Fundraising Efficiency (%)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

Code Cell 20

plt.figure(figsize=(12, 7))
sns.scatterplot(x=df['Highest_Compensation'], y=df['fundraising efficiency %'], hue=df['category'], palette='viridis', s=100)
plt.title('Fundraising Efficiency vs. Highest Compensation by Category')
plt.xlabel('Highest Compensation ($)')
plt.ylabel('Fundraising Efficiency (%)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.ticklabel_format(style='plain', axis='x') # Prevent scientific notation on x-axis for compensation
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

Analysis: Write 1-2 paragraphs discussing:

Whether your research question can be answered with the available data
What insights you discovered through your visualizations
Any unexpected or interesting findings
Limitations of your analysis
Suggestions for further investigation

Please create a text cell below this cell to write your analysis.

My research question cannot be fully answered with the available data, but my exploration did yield some interesting insights. This data is collected on Forbes top 100 nonprofits, and these organizations had a high rate of fundraising efficiency, with a mean near 90%. The standard deviation was about 7 percentage points, indicating that there is variability, but that most organizations in this study were close to the mean.

I expected that higher donor dependence would be correllated with greater fundraising efficiency, but this did not appear to be the case. I also compared highest compensation to fundraising efficiency, expecting that higher compensation may predict a looser spending policy and reduced efficiency. Surprisingly, some of the organizations with higher-than-average compensation still saw high levels of efficiency, while others with lower compensation dipped below 70% efficiency.

My analysis is limited by the study group, Forbes top 100 nonprofits. These high-performers may have attributes in common that bias fundraising efficiency and obscure important variables. Further research might look at a broader selection of organizations to determine the predictors of fundraising efficiency.

Submission Requirements

NOTE: If you use a local dataset, you MUST submit the dataset along with your notebook to get credit.

Download your notebook as CS110_Project6.ipynb
Ensure all TODO tasks are completed
Make sure your code runs without errors
Include clear comments explaining your work
Submit the notebook file

Grading Criteria (5 points)

Dataset Selection: Appropriate choice of dataset with sufficient complexity (0.5 point)
Inquiry Question: Clear, specific, and answerable with the chosen data (1 point)
Data Exploration: Proper use of pandas functions for initial analysis (0.5 point)
Visualizations: At least 2 clear, well-formatted, and relevant plots (1 point)
Analysis: Thoughtful interpretation of findings and answer to research question (1 point)
Code Quality: Clean, commented, and well-organized code (1 point)

Remember: The goal is not just to create visualizations, but to use them to gain meaningful insights about your data. Think like a data scientist and let your curiosity guide your exploration!