Mobi station activity

I finally got around to learning how to map data on to maps with Cartopy, so here's some quick maps of Mobi bikeshare station activity.

First, an animation of station activity during a random summer day. The red-blue spectrum represents whether more bikes were taken or returned at a given station, and the brightness represents total station activity during each hour. I could take the time resolution lower than an hour, but I doubt the data is very meaningful at that level.

[video width="704" height="528" mp4="http://mikejarrett.ca/blog/wp-content/uploads/movie_2017-08-18.mp4"][/video]

There's actually less pattern to this than I expected. I thought that in the morning you'd see more bikes being taken from the west end and south False Creek and returned downtown, and vice versa in the afternoon. But I can't really make out that pattern visually.

I've also pulled out total station activity during the time I've been collecting this data, June through October 2017. I've separated it by total bikes taken and total bikes returned. A couple things to note about these images: many of these stations were not active for the whole time period, and some stations have been moved around. I've made no effort to account for this; this is simply the raw usage at each location, so the downtown

The similarity in these maps is striking. Checking the raw data, I'm seeing incredibly similar numbers of bikes being taken and returned at each station. This either means that on aggregate people use Mobis for two way trips much more often than I expected; one way trips are cancelling each other out; or Mobi is rebalancing the stations to a degree that any unevenness is being masked out*. I hope to look more into whether I can spot artificial station balancing from my data soon, but we may have to wait for official data from Mobi to get around this.

*There's also the possibility that my data is bad, but let's ignore that for now

Instead of just looking at activity, I tried to quantify whether there are different activity patterns at different stations. Like last week, I performed a primary component analysis (PCA) but with bike activity each hour in the columns, and each station as a row. I then plot the top two components which most explain the variance in the data.

Like last week, much of the difference in station activity is explained by the total number of trips at each station, here represented on the X axis. There is a single main group of stations with a negative slope, but some outliers that are worth looking at. There are a few stations with higher Y values than expected.

These 5 stations are all Stanley Park stations. There's another four stations that might be slight outliers.

These are Anderson & 2nd (Granville Island); Aquatic Centre; Coal Harbour Community Centre; and  Davie & Beach. All seawall stations at major destinations. So all the outlier stations are stations that we wouldn't expect to show regular commuter patterns, but more tourist-style activity.

I was hoping to see different clusters to represent residential area stations vs employment area stations, but these don't show up. Not terribly surprising since the Mobi stations cover an area of the city where there is fairly dense residential development almost everywhere. This fits with our maps of station activity, where we saw that there were no major difference between bikes taken and bikes returned at each station.

All the source code used for data acquisition and analysis in this post is available on my github page.

To see more posts like this, follow me on twitter @mikejarrett_.

Machine learning with Vancouver bike share data

Six months ago I came across Jake VanderPlas' blog post examining Seattle bike commuting habits through bike trip data. I wanted to try to recreate it for Vancouver, but the city doesn't publish detailed bike trip data, just monthly numbers. For plan B, I looked into Mobi bike share data. But still no published data! Luckily, Mobi does publish an API with the number of bike at each station. It doesn't give trip information, but it's a start.

All the code needed to recreate this post is available on my github page.

Data Acquisition

The first problem was to read the API and take a guess at station activity. To do this, I query the API every minute. Whenever the bike count at a station changes, this is counted as bikes being taken out or returned. I don't know exactly how often Mobi updates this API, but I'm certainly undercounting activity -- whenever two people return a bike and take a bike within a minute or so of each other I'll miss the activity. But it's good enough for my purposes, and I'm more interested in trends than total numbers anyway.

I had two main problems querying the API: First, I'd starting by running the query script on my home machine. This meant that any computer downtime meant missing data. There's a few days missing while I updated my computer. Eventually I migrated to a google cloud server, so downtime is no longer an issue, but this introduced the second problem: time zones. I hadn't set a time zone for my new server, so all the times were recorded as UTC, while earlier data had been recorded in local Vancouver time. It took a long time of staring at the data wondering why it didn't make sense for me to realize what had happened, but luckily an easy fix in Pandas.

Analysis

Our long stretch of good weather this summer is visible in the data. Usage was pretty consistent over July and August, and began to fall off near the end of September when the weather turned. I'll be looking more into the relationship between weather and bike usage once I have more off-season data, but for now I'm more interested in zooming in and looking at daily usage patterns. Looking at a typical week in mid summer, we see weekdays showing a typical commuter pattern with morning and evening peaks and a midday lull. One thing that jumps out is the afternoon peak being consistently larger than the morning peak. With bike share, people have the option to take the bus to work in the morning and then pick up a bike afterwork if they're in the mood. Weekends lose that bimodal distribution and show a single normal distribution centered in the afternoon. On most weekend days and some weekdays, there is a shoulder or very minor peak visible in the late evening, presumably people heading home from a night out.

Looking at the next week, Monday immediately jumps out as showing a weekend pattern instead of a weekday. That Monday, of course, is the Monday of the August long weekend.

So, by eye we can fairly easily distinguish weekday and weekend travel patterns. How can we train a computer to do the same?

First, I pivoted my data such that each row is a day, and each column is the hourly bike activity at each station (# columns = # stations * 24). I decided to keep the station information instead of summing across stations, but both give about the same result. This was provided as input to the primary component analysis (PCA) class of the Scikit-Learn Python package. PCA attempts to reduce the dimensionality of a data set (in our case, columns) while preserving the variance. For visualization, we can plot our data based on the the two components which most explain the variance in the data. Each point is a single day, colour labelled by total number of trips that day.

PCA coloured by number of daily trips

It's apparent that the first component (along the X axis) corresponds roughly (but not exactly) to total number of trips. But what does the Y axis represent? To investigate further, we label the data points by day of week.

PCA coloured by day of week

The pattern is immediately clear. Weekdays are clustered at the bottom of our plot, and weekends are clustered at the top. A few outliers jump out. There are 3 Mondays clustered in with the weekend group. These turn out to be the Canada Day, BC Day and Labour Day stat holidays.

PCA with noteable Mondays labelled

Finally, I wanted to try unsupervised clustering to see if weekday and weekend clusters are separated enough to be distinguished automatically. For this, I used the GaussianMixture class from Scikit-learn. Here, we try to automatically split our data into a given number of groups, in this case two.

PCA and unsupervised clustering of June-September bike share usage

Not quite. There is a group of low-volume weekend days in the top right cornerthat can't be automatically distinguished from weekdays. All these days are in June and September. Maybe with more non-summer data this will resolve itself.

Out of curiosity, I re-ran the PCA and unsupervised clustering with only peak season data (July and August). Here, with more a more homogenous dataset, clustering works much better. In fact, only the first component (plotted along the X axis) is needed to distinguish between usage patterns.

PCA and unsupervised clustering of July and August bike share usage

Bike share usage will obviously decline during Vancouver's wet season, but I'm very interested to see how usage patterns will differ during the lower volume months.

All the source code used for data acquisition and analysis in this post is available on my github page.

To see more posts like this, follow me on twitter @MikeJarrett_.

Datetime axis formatting with Pandas and matplotlib

Panda's Dataframe.plot() function is handy, but sometimes I run up against edge cases and spend too much time trying to fix them.

On one case recently, I wanted to overlay a line plot on top of a bar plot. Easy, right? Not when your dataframe has a datetime axis. The bar plot and and line plot functions format the x-axis date labels differently, and cause chaos when you try to use them on the same axes. None of the usual tick label formatting methods got me back anything useable.

The solution was to take a step back and use the basic matplotlib functions to plot instead of the pandas wrappers. Calling ax.plot() and ax.bar() give sensible outputs where df.plot() didn't.

See the below notebook for an example of the problem and solution.

Matplotlib on the web

I've been learning a lot of Matplotlib recently at work and through a course I took on data visualization. I've especially had fun with making interactive plots, and I was curious about whether MPL code could be converted into HTML5 and/or javascript for presentation on the web. I'm a researcher, not a developer, so my main goal was to use the datavis library I already know without having to learn a bunch of javascript.

In my googling, I've found a few solutions to this issue. None are perfect for what I was hoping to do (have Django and MPL work together to produce a nice interactive figure) but I did come across some interesting tools.

The 'webagg' backend for matplotlib

This was the solution I was expecting to work. When you specify the webagg backend after importing matplotlib, running f.show() starts up a mini web server and opens the plot in your default browser. As far as I can tell, the plot has the full functionality you'd get with any of the traditional backends.

The magic of having full interactivity in the browser relies on the figure staying connected to the python kernel, which is non-trivial to set up on a live website. I haven't found any instructions that a non-expert can follow to set this up on a personal website.

MPLD3

mpld3 is a python package that converts your matplotlib code into d3.js code. It's great! Instead of f.show(), you can run mpld3.save_html(f, 'test.html') and you have a html snippet that can be inserted into a webpage. The plot below is just vanilla matplotlib saved with mpld3.save_html.

%CODE3%

The plots look great and you get the default panning/zooming functions you expect in MPL. However widgets don't work so you don't get the full interactive experience. mpld3 does have some plugins, so for example you can get tooltips when hovering over plot elements. This is great, but requires some non-MPL code. Below I've added a hover tooltip which displays the label of a line plot when you hover over it. Very neat, but still not quite matplotlib's picker.

%CODE4%

The mpld3 website has a tutorial on writing your own plugins, and it looks like there's very little stopping you from recreated any feature you want as long as you know d3.js. I really appreciate all the work that's gone into mpld3, but it's still missing some pretty important features. For example, you might have noticed the years in the x-axis tick labels have a comma in them, which would be easily fixable in vanilla MPL with ax.xaxis.set_ticklabels(). But mpld3 doesn't (yet?) have support for setting ticklabels, so you're stuck with whatever shows up. A small price to pay for the ease of use, but still noticeable.

Bokeh

Bokeh is a python visualization library that targets web browsers. It looks like it's further along in development than mpld3, however it's not trivially convertible from matplotlib. There is a bokeh.mpl module, but it is no longer maintained and is considered deprecated. I haven't tried Bokeh myself yet, but if I was going make web visuals a major focus and wanted to stick with python, this is probably where I'd end up.

plot.ly

plot.ly seems to be the enterprise solution for interactive web plots in python (and other languages), and it funnels users towards hosting plots on its own server and has a subscription model. But they've released their code as open source, and it's possible to use the plotly library to convert your code to javascript to add to your own site. It's certainly more fully featured, but it's also substantially heavier than mpld3 -- for the same plot, the html snippet is 20,000 character for plot.ly but 2,000 characters for mpld3. To make the plot below, I took the same code as above and just added import plotly.plotly as py import plotly.tools as tls import plotly.offline as plo

plotly_fig = tls.mpl_to_plotly( f ) plo.plot(plotly_fig, filename='housing_plotly')

%CODE5%

The tooltips happen without any extra code. I haven't spend much time learning the ins and outs of plotly so I'm not sure how customizable it is. I like that it works out of the box and looks professional, but the default tooltips and toolbar do feel a little busy. Hard to argue with free and easy though!

For me, the likely answer seems to be that I'll use mpld3 when I want simple and clean web visualizations, and I'll use plot.ly when I want to have a bit more interactivity. If I end up spending more of my time doing this I'll just learn Bokeh instead, but I'd rather not learn a new library if I don't have to. And I'll keep my fingers crossed that smarter people than me make it easy to host fully interactive MPL plots which stay connected to the python kernel on any django website.

Update: Jake VanderPlas gave a great talk at Pycon that covers all this that I wish I'd found while researching all this.

https://www.youtube.com/watch?v=FytuB8nFHPQ