In this project, I analyze historical bike trip data from a Chicao bike-share company, using R and create powerful visualizations using Tableau.
A more in-depth breakdown of the scenario is included below, followed
by my full report.
Scenario
Cyclistic is a bike-share company in Chicago with over 5,824 bikes that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. There are 2 types of customers: 1) casual riders who purchase single-ride or full-day memberships and 2) annual members who purchase annual memberships.
I am a junior data analyst working in the marketing analyst team and the director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore,your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
ASK: Defining the Business Task
Three questions will guide the future marketing program:
- How do annual members and casual riders use Cyclistic bikes differently?
- Why would casual riders buy Cyclistic annual memberships?
- How can Cyclistic use digital media to inuence casual riders to become members?
The dierctor of marketing and my manager has assigned me the first question to answer. Therefore the business task can be stated as follows:
Analyze historical bike trip data to identify trends in how annual members and casual riders use Cyclistic bikes differently.
PREPARE: Data sources
We used historical bike trip data from the last full year (12 months): January 2022 – December 2022. This helps us get a full picture of the annual calendar year and the potential effect from seasons factors. The data is from Divvy Bike Trip Data and has been made publicly available by Motivate International Inc. under this license . We downloaed the following CSV files:
1) 2022-01_divvy_trip-data.csv
2) 2022-02_divvy_trip-data.csv
3) 2022-03_divvy_trip-data.csv
4) 2022-04_divvy_trip-data.csv
5) 2022-05_divvy_trip-data.csv
6) 2022-06_divvy_trip-data.csv
7) 2022-07_divvy_trip-data.csv
8) 2022-08_divvy_trip-data.csv
9) 2022-09_divvy_trip-data.csv //renamed to match other filenames.
10) 2022-10_divvy_trip-data.csv
11) 2022-11_divvy_trip-data.csv
12) 2022-12_divvy_trip-data.csv
The data is organized with each row (record) corresponding to a single trip identified by ride_id
and incldues the following columns (fields):
* ride_id #Ride id - unique
* rideable_type #Bike type - Classic, Docked, Electric
* started_at #Trip start day and time
* ended_at #Trip end day and time
* start_station_name #Trip start station
* start_station_id #Trip start station id
* end_station_name #Trip end station
* end_station_id #Trip end station id
* start_lat #Trip start latitude
* start_lng #Trip start longitute
* end_lat #Trip end latitude
* end_lat #Trip end longitute
* member_casual #Rider type - Member or Casual
There are no issues with bias or credibility as all personal identifiable information (PII) has been removed. The data is also credible as it is primary data from the company itself. In other words, it ROCCC’s:
- Reliable and Original: the data is both reliable and original as it is primary source data.
- Comprehensive: the data has all of the relevant fields necessary for our historical analysis.
- Current: the data is current as it is from the specific time frame we need 2022 and is updated monthly.
- Cited: the data is cited as it is verified as a primary source. I have saved a folder of the original data and made copies to manipulate for my analysis.
PROCESS: Data Cleaning & Manipulation
R (Programming Language): Initial Data Cleaning and Manipulation
Due to the large size and number of files, I performed my processing and analysis using R and exported it for visualization using Tableau.
Step 0: Initialize Workspace
Step 1: Collect Data
I downloaded the files and renamed 2022-09_divvy_trip-data.csv
to match the other filenames.
Step 2: Wrangle Data and Combine into a Single File
Note, we will need to compare the column names to see if an rbind is possible. We use the helpful compare_df_cols_same
fcn from our janitor library. This saves space as the alternative would be to call the column name of each file and manually compare them (e.g. colnames(jan22)
).
Then we inspect the dataframes to look for incongruencies.
Lastly, we combine the files into a single dataframe.
Step 3: Clean Up and Add Data
First, we inspect the new dataframe that has been created:
We note that there are a few problems we need to fix:
- The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data (e.g. day, month, year) that provide additional opportunities to aggregate the data.
- We will want to add a calculated field for length of ride as
ride_length
for helpful analysis and inspect it.
Note, we need to convert ride_length
to numeric and convert it to minutes as it’s easier to understand:
Before we continue, we decide to remove “bad” data that includes missing start/end station id
and/or names or has negative ride_length
values. Then, we will create a trip_type
variable to inspect if there are any insights in how casual riders and annual members use the service differently based on this:
ANALYZE: Summary of Calculations, Trends, and Relationships
Descriptive Analysis
We will begin with a descriptive analysis on our cleaned up data set all_trips_v2
. We will compare the mean, median, max, and min between casual riders and annual members.
We also analyze by DAY, MONTH, and RIDE_TYPE by WEEKDAY
There are a few things to note:
- At first glance, casual riders’ trips are almost 2x as long as annual members’ on average.
- However, we note that the mean for the total trips is 10.60 min while the mean is 17.10 min, indicating a presence of outliers skewing the average.
- This is confirmed as the max for a casual rider is 34,354 minutes, which would indicate a ride length of about 23 days! * Furthermore, the max ride for an annual member is 1,493 minutes(or 24 hours!)
- Upon glancing at the data, we recongize using the median and also the total amount of rides (versus length) may be more helfpul.
Visual Analysis
Next, we will conduct a visual analysis. First, we will note how annual members and casual riders differ on the total amount of rides given the day of the week.
Here is the resulting visualization:
We will break down this visualization further:
Here is the resulting visualization:
INSIGHT: Annual members prefer the weekdays whereas casual riders prefer the weekends.
Next, we will note how annual members and casual riders differ on the average duration of rides given the day of the week.
Here is the resulting visualization:
Last, we will note how annual members and casual riders differ on the average median of rides given the day of the week (due to the skewing mentioned earlier).
Here is the resulting visualization:
As we see, these totals are closer together, and the maximum values are around the 1000 minutes range compared to the 1750 minutes range from the mean graph above. Also, we recognize that in both the mean and median, casual riders have longer ride lengths than annual members
INSIGHT: Casual riders have longer ride lengths than annual members
SHARE: Supporting Visualizations and Key Findings
Lastly, we export the aggregate data for visualizations in Tableau:
Conclusion
Stakeholder presentation and dashboard
I’ve provided links below for my dashboard and shareholder presentation, which includes the following:
- A summary of my analysis
- Supporting visualizations and key findings
- Three recommendations based on my analysis
Tableau Dashboard: Bikeshare in Chicago