Introduction
This case study is my capstone project for the Google Data Analytics Certificate. It involves analysis of historical data for a fictional company, Cyclistic, a bike sharing company located in Chicago, to make recommendations for an upcoming marketing campaign. Although the company and scenario are fictitious, the data used for this project are real data collected between August 2020 – July 2021 from a bike share program in Chicago. In this project I am assuming the role of the junior analyst.
Scenario
Cyclistic is a fictional bike sharing company located in Chicago. It operates a fleet of more than 5,800 bicycles which can be accessed from over 600 docking stations across the city. Bikes can be borrowed from one docking station, ridden, then returned to any docking station in the system Over the years marketing campaigns have been broad and targeted a cross-section of potential users. Data analysis has shown that riders with an annual membership are more profitable than casual riders. The marketing team are interested in creating a campaign to encourage casual riders to convert to members.
The marketing analyst team would like to know how annual members and casual riders differ, why casual riders would buy a membership, and how Cyclistic can use digital media to influence casual riders to become members. The team is interested in analyzing the Cyclistic historical bike trip data to identify trends in the usage of bikes by casual and member riders.
I. ASK
Business Objective
To increase profitability by converting casual riders to annual members via a targeted marketing campaign.
Business Task for Junior Analyst
The junior analyst has been tasked with answering this question: How do annual members and casual riders use Cyclistic bikes differently?
Stakeholders
The stakeholders in this project include:
Lily Moreno, Director of Marketing at Cyclistic, who is responsible for the marketing campaigns at Cyclistic.
The Cyclistic marketing analytics team. This team is responsible for collecting, analyzing and reporting data to be used in marketing campaigns. I am the junior analyst on this team
The Cyclistic executive team. This team makes the final decision on the recommended marketing plan. They are notoriously detail-oriented.
II. PREPARE
Where is Data Located?
The data used for this analysis were obtained from the Motivate, a company employed by the City of Chicago to collect data on bike share usage.
How is the Data Organized?
The data is organized in monthly csv files. The most recent twelve months of data (August, 2020 – July 2021) were used for this project. The files consist of 13 columns containing information related to ride id, ridership type, ride time, start location and end location and geographic coordinates, etc.
Credibility of the Data
The data is collected directly by Motivate, Inc., the company that runs the Cyclistic Bike Share program for the City of Chicago. The data is comprehensive in that it consists of data for all the rides taken on the system and is not just a sample of the data. The data is current. It is released monthly and, as of August 2021, was current to July 2021. The City of Chicago makes the data available to the public.
Licensing, privacy, security, and accessibility
This data is anonymized as it has been stripped of all identifying information. This ensures privacy, but it limits the extent of the analysis possible. There is not enough data to determine if casual riders are repeat riders or if casual riders are residents of the Chicago area. The data is released under this license.
Ability of Data to be used to answer Business Question
One of the fields in the data records the type of rider; casual riders pay for individual or daily rides and member riders purchase annual subscription. This information is crucial to be able to determine differences between how the two groups use the bike share program.
Problems with the data
There are some problems with the data. Most of the the problems (duplicate records, missing fields, etc.) can be dealt with by data cleaning, but a couple of issues require further explanation.
Rideable-type Field
The rideable_type field contains one of three values – Electric bike, Classic bike or Docked bike. Electric and Classic bikes seem self-explanatory, but what exactly a Docked bike is is unclear. From a review of the data it seems that electric bikes were available to both types of users for the entire 12 month period; classic bikes were available to both groups of users but only from December 2, 2020 to July 31, 2021; and Docked Bikes were available to members from August 1, 2020 to January 13, 2021 and to casual users for the entire 12 months. For the purpose of this analysis these rideable types will not be used to segment the data or draw any conclusions about bike preferences as data collection for this variable is not consistent across the time period being analyzed.
Latitude and Longitude
There is also a challenge with the latitude and longitude coordinates for the stations. Each station is represented by a range of lat/long coordinates. The start/end latitude and longitude seem to indicate the location of a particular bike. Creating a list of popular stations is not difficult, but mapping the stations is more problematic. This was remedied by averaging the lat and long values for the stations before mapping. This resulted in the rides counts for a station matching the ride count for one set of lat/long coordinates.
III. PROCESS & CLEAN
What tools are you choosing and why?
For this project I choose to use RStudio Desktop to analyze and clean the data and Tableau to create the visualizations. The data set was too large to be processed in spreadsheets and RStudio Cloud.
Review of Data
Data was reviewed to get an overall understanding of content of fields, data formats, and to ensure its integrity. The review of the data involved, checking column names across the 12 original files and checking for missing values, trailing white spaces, duplicate records, and other data anomalies.
The review of the data revealed several problems:
- Duplicate record ID numbers
- Records with missing start or end stations
- Records with very short or very long ride durations
- Records for trips starting or ending at an administrative station (repair or testing station)
Once the initial review was completed, all twelve files were loaded into one data frame. The resulting amalgamated file consisted of 4.731,081 rows with 13 columns of character and numeric data. This matched the number of records in the twelve monthly files.
Extraction of Data from Existing Fields
To allow for more granular analysis of the data and more insights, several new columns were created and populated with data from the started_at column of date and time. These new columns were day, month, year, time and day of the week.
Another column was created to contain the trip duration (length of each trip). The data for this column was created by calculating the difference in time between the start and end time of the ride. Another version of this column was then created to contain the trip duration in minutes.
Data Cleaning
Duplicate records (based on the RIDE ID field) were removed. (209 records removed)
alltrips_v2 <- distinct(alltrips, ride_id, .keep_all=TRUE)
Records for trips less than 60 seconds (false starts) or longer than 24 hours were removed. Bikes out longer than 24 hours are considered stolen and the rider is charged for a replacement. (82,282 records removed)
alltrips_v2 <- alltrips_v2[!(alltrips_v2$ride_length<60 | alltrips_v2$ride_length>86400),]
Records with missing fields start_station, end_station, start/end lat/long fields were removed. (544,204 records removed)
alltrips_v3 <- alltrips_v2[!(is.na(alltrips_v2$start_station_id) | is.na(alltrips_v2$end_station_id) | is.na(alltrips_v2$ride_id) | is.na(alltrips_v2$rideable_type) | is.na(alltrips_v2$started_at) | is.na(alltrips_v2$ended_at) | is.na(alltrips_v2$end_lat) | is.na(alltrips_v2$end_lng)),]
Records for trips that started or ended at DIVVY CASSETTE REPAIR MOBILE STATION or HUBBARD ST BIKE CHECKING (LBS-WH-TEST) were removed as these are administrative stations. (143 records removed)
alltrips_v3<- alltrips_v3[!(alltrips_v3$start_station_name == "DIVVY CASSETTE REPAIR MOBILE STATION" | alltrips_v3$start_station_name == "HUBBARD ST BIKE CHECKING (LBS-WH-TEST)" | alltrips_v3$start_station_name == "WATSON TESTING DIVVY" | alltrips_v3$end_station_name == "DIVVY CASSETTE REPAIR MOBILE STATION" | alltrips_v3$end_station_name == "HUBBARD ST BIKE CHECKING (LBS-WH-TEST)" | alltrips_v3$end_station_name == "WATSON TESTING DIVVY"),]
Station names were checked for leading and trailing spaces during the analysis phase and there did not appear to be any.
Initially the data set contained 4,731,081 records. Once data was cleaned, 4.101,243 records remained. 13.3% of the records were removed.
IV. ANALYZE
Once the data was cleaned, analysis of the data was undertaken in RStudio to determine the following:
- Mean, median, maximum and minimum ride duration (by rider type)
- Average ride length by day and by rider type
- Count of trips by rider type
- Count of trips by bicycle type
- Count of the number of start stations
The cleaned data set was used to create a csv file that was exported from RStudio and imported into Tableau for further analysis and creation of visualizations.
Tableau was used to further analyze the data and determine:
- Ride duration
- Times of Day for rides
- Days of the week for rides
- Months of the year of the rides
- Top 20 start stations by user type
- Top 20 end stations by user type
Summary of analysis
From the analysis we can see that there are several key differences between casual and member riders.
Number and Length of Rides
Member riders take more trips than casual riders but casual riders take longer rides than member riders (Figure 1). Casual riders average 31.85 minutes per ride as opposed to 14.43 minutes for member riders (Figure 2). The total amount of time ridden by casual riders is greater than by member riders (Figure 3).
Trips by Time/Day/Month
The number of trips made by casual riders increases over the day, peaking at 5 pm. Member trips also peak at 5 pm but there are two smaller peaks earlier in the day at 8 am and lunch time, which corresponds with the work day.
Figure 5 shows that weekend days are popular with casual riders whereas member rider trips are spread out more evenly throughout the week.
The winter months (December, January and February) see very few rides. The summer months are popular with both types of riders. July is the busiest month for casual riders.
Station Usage
Member and casual riders also differ in the stations that are popular for starting and ending their rides.
Top starting and destination stations for casual riders cluster around tourist destinations within about 1 km of the lakefront from Lincoln Park in the north to the Field Museum in the south. Member riders top stations are more spread out and reflect office locations.
V. SHARE
Detailed documentation of R code is available on Kaggle and further, interactive visualizations are available on Tableau Public.
VI. ACT
Top Three Recommendations
Based on an analysis of the data, the following recommendations can be made to the Cyclistic stakeholders:
- The marketing campaign should be targeted at the popular start and end stations for casual riders.
- To reach the most riders, marketing should be targeted for the busiest casual rider days (Friday, Saturday and Sunday), busiest hours (afternoon) and the most popular months (June, July and August).
- Further data should be released or obtained to determine which casual riders are local to the Chicago area (as these riders are more likely to consider an annual membership than a tourist from out of the city) and to determine what changes might need to be made to the existing membership subscription model to make it more appealing to casual riders (casual riders have an average trip length of 32 minutes, longer on weekends, and the annual membership has ride lengths caps of 45 minutes).