Tracking bikeshare use using GBFS feeds

The General Bikeshare Feed Specification (GBFS) is a popular standard for publishing live data about bikeshare systems used by systems around the globe. Its stated purpose is to allow 3rd party applications to interface with the live data and allow municipalities to monitor compliance with local regulations. For my purpose, it also allows me to monitor system usage in real time. Here I'll go through the steps I use to track bikeshare usage in real time using GBFS feeds. I currently publish live tracking of Vancouver bikeshare systems at @VanBikeShareBot and will soon be adding tracking of Toronto, Hamilton and Montreal systems.

I've bundled the tools I use for GBFS monitoring into a python package that can be found on Github. The Bikedata package is not a full-service GBFS client -- for that, you might prefer GBFS-client. Bikedata provides minimal functionality for querying GBFS feeds and returning Pandas dataframes, and implements some helper functions for persistent monitoring of bikeshare systems.

The GBFS spec

I won't go into detail about the GBFS spec, more information can be found on the project's Github page. Suffice to say that a GBFS complient system offers several distinct JSON endpoints that provide information about the system. For example, Mobi Bikes in Vancouver provides:

Mobi doesn't have free floating bikes, but systems that do also have a free_bikes.json feed providing the location of available free floating bikes.

Tracking station-based systems

Many bikeshare systems, especially those in dense city centres, only allow trips to begin and end at physical stations. To monitor these systems, I periodically query the station_status.json and record the number of bikes at each station. If the number of available bikes decreases by N, I count that as N departures from the station. If it increases by M, I count that as M bikes returned.

Read more…

Optimizing your Nikola blog for Jupyter notebooks

You've installed Nikola. You've also followed some instructions on how to use Jupyter notebooks as posts. Everything works! But after you make your first Jupyter post, the results are... uninspiring. The input prompts are ugly. If you've written a post designed primarily for the content, not the code, there's no was to turn off showing the code cells. This post will describe how I -- someone with almost no experience with javascript, CSS, etc -- made my blog more functional to work with Jupyter notebooks.

Read more…

Connecting neighbourhoods

I've been working on how best to look at connections between the various Mobi bikeshare stations throughout Vancouver. One thing the quickly became obvious was that some visualizations are much too noisy with all stations, but work nicely when I group stations into neighbourhoods. Luckily the city geometry files of the various neighourhoods available in their open data collection, so I was able to use those to group the stations according to the city's official definitions. See the distributions of stations below.

Read more…

How Vancouver Uses Mobi Bikes

Vancouver's Mobi bikeshare system has been up and running for over 2 years now, and with two full summers of activity it's time to take a look at how exactly Vancouverites are using their bikeshare system.

For over a year, I've been collecting real-time data about Mobi bike trips by monitoring public information about the number of bikes at each station and inferring trip activity based on changes to the number of bikes at each station. This has led to some fun uses: I have live figures updating constantly on my website, a twitter bot tweets out daily stats at @VanBikeShareBot, and a few blog posts.

As handy as those live trip estimates are, they're very much estimates and only give us information about how often certain stations are used. Luckily, Mobi has started publishing open system data. This data set gives us a registry of every Mobi bikeshare trip since the beginning of 2017, current to the end of 2018 as of this writing. With this we have access to trip start and endpoints, trip duration and distance, membership type and more. In this post, I'll summarize some of the things I've learned after spending some time looking into this data.

Read more…

Storing metadata in Pandas DataFrames

I came across a problem recently where I wanted to store metadata about each column in a pandas DataFrame. This could be done easily enough with a simple dictionary where each column name is a key, but I was hoping for something that would propogate with a DataFrame as it's sliced, copied, expanded, etc, without having to explicitely keep track of the metadata in my scripts.

My solution is to use the pandas MultiIndex. In this post I'll discuss the pros and cons of this approach, show some examples, a show how I've subclassed the pandas DataFrame to make this approach easier.

Read more…

Mapping Google's Location History in Python

If you have an Android phone, Google is tracking your location. If you have "Location History" turned on in your settings, Google is really tracking your location. If you have this setting turned off, Google will still record your location when an app that has proper permission (like Google Maps or Strava) requests your location. This is creepy and probably bad from a privacy standpoint, but it's cool and fun from a data and visualisation standpoint. In the interest of blog content, I've let google track my location since I got my new phone ~1.5 years ago and it's now time to look at the data.

Google offers ways to look at your location history through their apps, but it's more fun to do it myself. Through myaccount.google.com, you can create an archive of your personal information being stored by Google. You can pick which services you're interested in, but for now I'm just looking for my location history data. It takes some time for Google to create the archive and make it available for download.

Read more…

What really affects bikeshare use in Vancouver?

City staff recently reported to Council about the status of the Mobi bikeshare system. You can see the slideshow for yourself here. Results generally look positive for the future of bikeshare in Vancouver, but a comment one slide stuck out for me: "Temperature influences ridership more than precipitation".

Is this really true? In my experience, biking on a cool, dry day is much more enjoyable than biking on a warm rainy day. There was no clue in the council report on how they came to this conclusion, but I suspect they used the eyeball test. But the cooler months in Vancouver are also the rainy months, so a more careful analysis is needed. I'd recently seen an example in Jake Vanderplas' Python Data Science Handbook that looked at the factors influencing bike ridership in Seattle, so I decided to do a similar thing for Vancouver.

I've described at bit more about how I collected the Mobi trip data in a previous post. While the data is unofficial and has some clear sources of error, it should be reliable to look at usage trends. For weather data, I wrote a small scraper to grab historical Vancouver weather from weather.gc.ca. All the code I used for this post is available on my github page.

First, let's take a zoomed out look at bike usage from late April to early November 2017. Rainfall and daily highs are plotted on the same scale, in degrees Celcius and millimeters of rain. Weekends are highlighted by grey bars. The first thing we see is the obvious broad trend across the seasons, matching up with the temperature trend. This is probably what city staff noticed. Next, we see that on days with sharp drop-offs from the broader trend, there's almost always some rainfall. So far so good!

(Days with missing data are days I had computer downtime before I moved my scraper to a cloud server.)

A few things to note. Weather data is per day, and for just one Vancouver weather station. There are certainly days where it pours overnight and the day is clear, or it rains more in one part of the city than another. That said, let's take a closer look at a few weeks.

Here's two consecutive weekends where there was substantial rain on Sunday but none on Saturday. The drop-off is clear on both weekends. But later in the second week, there is a day with much less rainfall that has almost the same drop-off in number of trips.

Here's another three week stretch. Again, days with rain clearly show reduced Mobi usage. But usage also follows the temperature line! How much of the variation in bike usage is due to temperature, and how much is due to rain?

First let's look at temperature and rainfall separately.

The relationship between temperature and bike share trips is strong and exactly what you'd expect. More people ride on warm days! I've coloured the data by rainfall to see if there's any interesting outliers, but the rainy days are all well within the trend. We had pretty great weather all summer this year, so no examples of really rainy days with warm temperatures.

Rainfall also shows a clear relationship with daily trips. But it's not linear like temperature. There's a band of zero rain days that correlate with temperature, then a linear segment as ridership falls off with increased rainfall, then ridership hits a baseline below which it doesn't decrease. Apparently regardless of the amount of rain there's something like 600-800 users who will take out a bike no matter what. Cool!

So, we need to make a model that incorporates both temperature and rainfall to try to separate their effects. But to be as accurate as possible, we should include any other prominent factors. I showed in a earlier post that over the course of a day, weekdays and weekends show different ridership patters. But it turns out there's no obvious difference in the total number of trips.

If we're going to think about weekday vs weekend, let's just include each day of the week as a separate factor. Stat holiday vs not holiday should also be included. Since there's such a dramatic difference between days with any rain and days with no rain, let's include "dry" days as a factor. The last factor I'll include is hours of daylight -- when it gets dark before 5pm, it's hard to say whether it's the temperature or the darkness that has more on an effect someone's decision to ride a Mobi.

Trips ~ Temperature + Rainfall + Dry + Holiday + Daylight + Monday + Tuesday + Wednesday + Thursday + Friday + Saturday + Sunday

To fit the model, I'll use the OLS (ordinary least squares) class from the statsmodels Python package.

                            OLS Regression Results                          
============================================================================== Dep. Variable: Trips R-squared: 0.812 Model: OLS Adj. R-squared: 0.801 Method: Least Squares F-statistic: 69.65 Date: Tue, 21 Nov 2017 Prob (F-statistic): 2.49e-58 Time: 13:57:32 Log-Likelihood: -1376.9 No. Observations: 189 AIC: 2778. Df Residuals: 177 BIC: 2817. Df Model: 11
Covariance Type: nonrobust
================================================================================ coef std err t P>|t| [0.025 0.975]

Max Temp 117.3302 8.079 14.522 0.000 101.386 133.274 Total Rainmm -29.8730 6.715 -4.449 0.000 -43.124 -16.622 Mon -523.1535 209.4 -2.498 0.013 -936.401 -109.906 Tue -523.9437 206.4 -2.537 0.012 -931.439 -116.448 Wed -497.6093 209.4 -2.376 0.019 -910.967 -84.252 Thu -429.2738 213.2 -2.013 0.046 -850.082 -8.465 Fri -368.8536 207.5 -1.777 0.077 -778.440 40.733 Sat -487.4666 206.5 -2.360 0.019 -895.072 -79.861 Sun -755.6349 207.6 -3.640 0.000 -1165.364 -345.906 Dry 362.8911 75.665 4.796 0.000 213.569 512.214 Holiday -277.1649 181.6 -1.526 0.129 -635.701 81.371 daylight_hrs 37.2471 15.672 2.377 0.019 6.319 68.175 ============================================================================== Omnibus: 9.879 Durbin-Watson: 1.087 Prob(Omnibus): 0.007 Jarque-Bera (JB): 20.264 Skew: 0.116 Prob(JB): 3.98e-05 Kurtosis: 4.587 Cond. No. 472. ==============================================================================

We can take a few things away from this table. The adjusted Rvalue is 0.801 which means our model explains about 80% of the variance in the data. Pretty good! We can also say that, all other things being equal, there will be
  • 117 more trips for each increase of one degree Celcius
  • 30 fewer trips for each additional of 1 mm of rain
  • 277 fewer trips on holidays versus equivalent non-holidays
  • 37 more trips for each additional hour of daylight
The coefficients associated with the days of the week can be interpreted as the intercept for each day of the week, in other words the number of trips expected on that day if all other factors are zero. This is obviously nonsense, since we can't have fewer than zero trips. But the daily high temperature for the dates I have data for are all well above zero, so the model doesn't have any data for temperatures near zero. More on this later.

Given our model, we can plot the expected number of trips for each day on top of the measured number of trip.

The model nails the macroscopic structure, driven by temperature and daylight, and does pretty well with the day-to-day variations which are probably more driven by rainfall and the day of the week.

There's a few big misses, though. Let's look at the days where our model misses the mark by over 1000 trips:

On July 11th, a Tuesday, our model expects a normal summer day but the counted number of trips drops of precipitously. The weather data for that says 20 degrees and no rain, so I'm not sure what's going on. I've looked at my source data and don't see anything out of the ordinary, but I suspect there was a data acquisition issue that day.

On August 12th, a Saturday, the model understates the number of riders by a large margin. There was 0.2 mm of rain measured on this day, so the model is treating it as a rainy day, but perhaps it only rained a bit overnight and was clear the rest of the day.

Finally, on October 12th, a Thursday, the model again undercounts the number of trips, only expecting ~200 trips when in reality there over a thousand! There was 35 mm of rain measured that day, so even though the model expects the number of trips to keep decreasing linearly with rain, in practice if someone is going to bike in 20 mm of rain they're probably also going to bike in 30 mm of rain. Our model doesn't account for this.

There's some other potentially nonlinear effects that we're not including here. During the winter in Vancouver, the coldest days are often clear and sunny. I wonder if as we get into December and January, the temperature trend might reverse and we'll see more trips on the coldest days. If that's the case, we may need a more complex model to really describe how weather affects bike share users.

All the source code used for data acquisition and analysis in this post is available on my github page.

To see more posts like this, follow me on twitter @MikeJarrett_.

Mobi station activity

I finally got around to learning how to map data on to maps with Cartopy, so here's some quick maps of Mobi bikeshare station activity.

First, an animation of station activity during a random summer day. The red-blue spectrum represents whether more bikes were taken or returned at a given station, and the brightness represents total station activity during each hour. I could take the time resolution lower than an hour, but I doubt the data is very meaningful at that level.

[video width="704" height="528" mp4="http://mikejarrett.ca/blog/wp-content/uploads/movie_2017-08-18.mp4"][/video]

There's actually less pattern to this than I expected. I thought that in the morning you'd see more bikes being taken from the west end and south False Creek and returned downtown, and vice versa in the afternoon. But I can't really make out that pattern visually.

I've also pulled out total station activity during the time I've been collecting this data, June through October 2017. I've separated it by total bikes taken and total bikes returned. A couple things to note about these images: many of these stations were not active for the whole time period, and some stations have been moved around. I've made no effort to account for this; this is simply the raw usage at each location, so the downtown

The similarity in these maps is striking. Checking the raw data, I'm seeing incredibly similar numbers of bikes being taken and returned at each station. This either means that on aggregate people use Mobis for two way trips much more often than I expected; one way trips are cancelling each other out; or Mobi is rebalancing the stations to a degree that any unevenness is being masked out*. I hope to look more into whether I can spot artificial station balancing from my data soon, but we may have to wait for official data from Mobi to get around this.

*There's also the possibility that my data is bad, but let's ignore that for now

Instead of just looking at activity, I tried to quantify whether there are different activity patterns at different stations. Like last week, I performed a primary component analysis (PCA) but with bike activity each hour in the columns, and each station as a row. I then plot the top two components which most explain the variance in the data.

Like last week, much of the difference in station activity is explained by the total number of trips at each station, here represented on the X axis. There is a single main group of stations with a negative slope, but some outliers that are worth looking at. There are a few stations with higher Y values than expected.

These 5 stations are all Stanley Park stations. There's another four stations that might be slight outliers.

These are Anderson & 2nd (Granville Island); Aquatic Centre; Coal Harbour Community Centre; and  Davie & Beach. All seawall stations at major destinations. So all the outlier stations are stations that we wouldn't expect to show regular commuter patterns, but more tourist-style activity.

I was hoping to see different clusters to represent residential area stations vs employment area stations, but these don't show up. Not terribly surprising since the Mobi stations cover an area of the city where there is fairly dense residential development almost everywhere. This fits with our maps of station activity, where we saw that there were no major difference between bikes taken and bikes returned at each station.

All the source code used for data acquisition and analysis in this post is available on my github page.

To see more posts like this, follow me on twitter @mikejarrett_.