I've started wondering if there are any network effects in Mobi trip data. I don't know much about network analysis, but luckily there are python packages to do the work for me: NetworkX provides general network analysis tools and basic plotting, and Community provides tools for determining clusters in a network. In this post, I'll use these tools to investigate how stations cluster together based on the frequency of trips between pairs of stations

Data prep¶

Data for this post comes from Mobi Bikes' official system data. I've done some minimal cleaning to consolidate the data and drop spurious records. To make the data a bit more manageable, I'll only look at the summer of 2019, May through August (inclusive).

In [1]:

import mobisys
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import community
import numpy as np

from cartopy.io.img_tiles import MapboxStyleTiles,MapboxTiles, GoogleTiles
import cartopy.crs as ccrs
from credentials import MAPBOX_TOKEN
from shapely.geometry import Point
import geopandas

In [2]:

%matplotlib notebook

In [3]:

fulldf = pd.read_csv('/home/msj/mobi-dash/data/Mobi_System_Data_Prepped_full.csv',index_col=0,parse_dates=True)

/home/msj/miniconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3049: DtypeWarning: Columns (2) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [4]:

df = fulldf['2019-05':'2019-08']

Next, I sum up the trips based on the Departure Station and Return Station. This gives us a table with one record for each pair of stations, with the number of trips taken between those stations.

In [5]:

cdf = df.groupby(['Departure station','Return station']).size()
cdf = cdf.reset_index()
cdf.columns = ['Departure station','Return station','trips']
cdf = cdf.sort_values('trips',ascending=False)

In [6]:

cdf.head()

Out[6]:

	Departure station	Return station	trips
23855	Stanley Park - Information Booth	Stanley Park - Information Booth	3231
24328	Stanley Park - Totem Poles	Stanley Park - Second Beach North	3177
23859	Stanley Park - Information Booth	Stanley Park - Totem Poles	2893
23856	Stanley Park - Information Booth	Stanley Park - Second Beach North	1742
24331	Stanley Park - Totem Poles	Stanley Park - Totem Poles	1557

Build the network¶

Now it's time to make the network. Luckily, NetworkX provides a convenience function to create a network object froma pandas dataframe. I just need to specify that the Departure station and Return station correspond to the nodes, and trips corresponds to the edges. In the interest of processing time, I'll only use connections that have had at least 10 trips between nodes and drop the rest.

In [7]:

G=nx.from_pandas_edgelist(cdf[cdf.trips>10], 'Departure station', 'Return station', 'trips')

We can try drawing this network, but right now it's not going to show us anything interesting. Most stations have connections to most other stations, so the network just looks like a mass of well-connected nodes.

In [8]:

f,ax = plt.subplots()
nx.draw(G,alpha=0.8, node_size=100)

Partition the network¶

One way to look for internal structure in the network is the Louvain method for community detection, which is implemented by the python package community as the function best_partition(). According to the docs:

Compute the partition of the graph nodes which maximises the modularity
(or try..) using the Louvain heuristices

This is the partition of highest modularity, i.e. the highest partition
of the dendrogram generated by the Louvain algorithm.

This function creates a dictionary that labels each node as part of a community according to the Louvain algorithm.

In [9]:

partition = community.best_partition(G,weight='trips')

In [10]:

set(partition.values())

Out[10]:

{0, 1, 2, 3, 4}

We see that the algorithm has divided our network into 5 communities. Let's visualize these communities on a network graph. I've added some helper functions to position the different communities in clusters. Node that in these visualizations the length of the edge isn't necessarily proportional to the strength of the connection.

In [11]:

#https://stackoverflow.com/questions/43541376/how-to-draw-communities-with-networkx

import numpy as np
import matplotlib.pyplot as plt
import networkx as nx

def community_layout(g, partition):
    """
    Compute the layout for a modular graph.


    Arguments:
    ----------
    g -- networkx.Graph or networkx.DiGraph instance
        graph to plot

    partition -- dict mapping int node -> int community
        graph partitions


    Returns:
    --------
    pos -- dict mapping int node -> (float x, float y)
        node positions

    """

    pos_communities = _position_communities(g, partition, scale=3.)

    pos_nodes = _position_nodes(g, partition, scale=1.)

    # combine positions
    pos = dict()
    for node in g.nodes():
        pos[node] = pos_communities[node] + pos_nodes[node]

    return pos

def _position_communities(g, partition, **kwargs):

    # create a weighted graph, in which each node corresponds to a community,
    # and each edge weight to the number of edges between communities
    between_community_edges = _find_between_community_edges(g, partition)

    communities = set(partition.values())
    hypergraph = nx.DiGraph()
    hypergraph.add_nodes_from(communities)
    for (ci, cj), edges in between_community_edges.items():
        hypergraph.add_edge(ci, cj, weight=len(edges))

    # find layout for communities
    pos_communities = nx.spring_layout(hypergraph, **kwargs)

    # set node positions to position of community
    pos = dict()
    for node, community in partition.items():
        pos[node] = pos_communities[community]

    return pos

def _find_between_community_edges(g, partition):

    edges = dict()

    for (ni, nj) in g.edges():
        ci = partition[ni]
        cj = partition[nj]

        if ci != cj:
            try:
                edges[(ci, cj)] += [(ni, nj)]
            except KeyError:
                edges[(ci, cj)] = [(ni, nj)]

    return edges

def _position_nodes(g, partition, **kwargs):
    """
    Positions nodes within communities.
    """

    communities = dict()
    for node, community in partition.items():
        try:
            communities[community] += [node]
        except KeyError:
            communities[community] = [node]

    pos = dict()
    for ci, nodes in communities.items():
        subgraph = g.subgraph(nodes)
        pos_subgraph = nx.spring_layout(subgraph, **kwargs)
        pos.update(pos_subgraph)

    return pos

In [12]:

pos = community_layout(G, partition)
colors = [partition[x] for x in G.nodes]

f,ax = plt.subplots()
nx.draw_networkx_nodes(G, pos, node_color=colors,node_size=50,alpha=1,cmap='jet')
nx.draw_networkx_edges(G,pos,alpha=0.1)

Out[12]:

<matplotlib.collections.LineCollection at 0x7f03cb032ef0>

This is all too abstract. To make sense of these communities, let's plot the stations on a map and color them according to which community they've been sorted in to. First, I need a list of stations with their lat/long coordinates. For this I recommend my bikedata package for querying bikeshare APIs (see my previous post for details).

In [13]:

import bikedata as bd

bs = bd.BikeShareSystem('mobi_bikes')
sdf = bs.query_station_info()

In [14]:

sdf.head()

Out[14]:

	capacity	lat	lon	name	station_id
0	52	49.262487	-123.114397	10th & Cambie	0001
1	28	49.285871	-123.121050	Burrard Station	0002
2	16	49.274566	-123.121817	Yaletown-Roundhouse Station	0004
3	26	49.279764	-123.110154	Dunsmuir & Beatty	0005
4	26	49.266675	-123.115635	Olympic Village Station	0006

For mapping, I'll convert the pandas dataframe into a geopandas dataframe to better handle coordinate projections, and I'll add the community label created when we partitioned the network.

In [15]:

sdf['geometry'] = [Point(xy) for xy in zip(sdf.lon, sdf.lat)]
sdf = geopandas.GeoDataFrame(sdf)
sdf.crs = {'init' :'epsg:4326'}
sdf = sdf.to_crs({'init': 'epsg:3857'})
sdf['partition'] = sdf.name.map(partition)
sdf.head()

Out[15]:

	capacity	lat	lon	name	station_id	geometry	partition
0	52	49.262487	-123.114397	10th & Cambie	0001	POINT (-13705031.98336093 6319517.879010359)	3.0
1	28	49.285871	-123.121050	Burrard Station	0002	POINT (-13705772.59193318 6323507.663575065)	2.0
2	16	49.274566	-123.121817	Yaletown-Roundhouse Station	0004	POINT (-13705857.97398261 6321578.565172686)	2.0
3	26	49.279764	-123.110154	Dunsmuir & Beatty	0005	POINT (-13704559.65476149 6322465.503044237)	2.0
4	26	49.266675	-123.115635	Olympic Village Station	0006	POINT (-13705169.7711981 6320232.349643161)	1.0

In [16]:

lon_min = -123.185
lon_max = -123.056
lat_min = 49.245
lat_max = 49.315

f,ax = plt.subplots(subplot_kw={'projection': ccrs.epsg(3857)},figsize=(7,7))
tile = MapboxStyleTiles(MAPBOX_TOKEN,'mjarrett','ck3gggjkl03gp1cpfm927yo7c')
extent = [lon_min,lon_max,lat_min,lat_max]
ax.set_extent(extent)
ax.add_image(tile,15,interpolation='spline36')

sdf.plot(ax=ax,column='partition', cmap='jet')

ax.outline_patch.set_visible(False)
ax.background_patch.set_visible(False)

The meanings behind the community partitions are now obvious. I'll name the five communities as:

Broadway corridor
East Van
False Creek
City Center/West End
Stanley Park/Seawall

I think most people who bike in Vancouver will intuitively agree that this is a sensible partition. There are a few outliers, like the station at 4th and Burrard getting grouped in with the City Center stations. But the partitioning algorithm has a bit of randomness built in and each time it's run it gives slightly different results. It's not surprising that stations at the base of the Burrard Bridge are connected to both downtown by commuter cyclists and False Creek by pleasure cyclists.

Data prep¶

Build the network¶

Partition the network¶

Comments