Mobi Station Networks
I've started wondering if there are any network effects in Mobi trip data. I don't know much about network analysis, but luckily there are python packages to do the work for me: NetworkX provides general network analysis tools and basic plotting, and Community provides tools for determining clusters in a network. In this post, I'll use these tools to investigate how stations cluster together based on the frequency of trips between pairs of stations
Data prep¶
Data for this post comes from Mobi Bikes' official system data. I've done some minimal cleaning to consolidate the data and drop spurious records. To make the data a bit more manageable, I'll only look at the summer of 2019, May through August (inclusive).
import mobisys
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import community
import numpy as np
from cartopy.io.img_tiles import MapboxStyleTiles,MapboxTiles, GoogleTiles
import cartopy.crs as ccrs
from credentials import MAPBOX_TOKEN
from shapely.geometry import Point
import geopandas
%matplotlib notebook
fulldf = pd.read_csv('/home/msj/mobi-dash/data/Mobi_System_Data_Prepped_full.csv',index_col=0,parse_dates=True)
df = fulldf['2019-05':'2019-08']
Next, I sum up the trips based on the Departure Station and Return Station. This gives us a table with one record for each pair of stations, with the number of trips taken between those stations.
cdf = df.groupby(['Departure station','Return station']).size()
cdf = cdf.reset_index()
cdf.columns = ['Departure station','Return station','trips']
cdf = cdf.sort_values('trips',ascending=False)
cdf.head()
Build the network¶
Now it's time to make the network. Luckily, NetworkX provides a convenience function to create a network object froma pandas dataframe. I just need to specify that the Departure station
and Return station
correspond to the nodes, and trips
corresponds to the edges. In the interest of processing time, I'll only use connections that have had at least 10 trips between nodes and drop the rest.
G=nx.from_pandas_edgelist(cdf[cdf.trips>10], 'Departure station', 'Return station', 'trips')
We can try drawing this network, but right now it's not going to show us anything interesting. Most stations have connections to most other stations, so the network just looks like a mass of well-connected nodes.
f,ax = plt.subplots()
nx.draw(G,alpha=0.8, node_size=100)
Partition the network¶
One way to look for internal structure in the network is the Louvain method for community detection, which is implemented by the python package community
as the function best_partition()
. According to the docs:
Compute the partition of the graph nodes which maximises the modularity
(or try..) using the Louvain heuristices
This is the partition of highest modularity, i.e. the highest partition
of the dendrogram generated by the Louvain algorithm.
This function creates a dictionary that labels each node as part of a community according to the Louvain algorithm.
partition = community.best_partition(G,weight='trips')
set(partition.values())
We see that the algorithm has divided our network into 5 communities. Let's visualize these communities on a network graph. I've added some helper functions to position the different communities in clusters. Node that in these visualizations the length of the edge isn't necessarily proportional to the strength of the connection.
#https://stackoverflow.com/questions/43541376/how-to-draw-communities-with-networkx
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
def community_layout(g, partition):
"""
Compute the layout for a modular graph.
Arguments:
----------
g -- networkx.Graph or networkx.DiGraph instance
graph to plot
partition -- dict mapping int node -> int community
graph partitions
Returns:
--------
pos -- dict mapping int node -> (float x, float y)
node positions
"""
pos_communities = _position_communities(g, partition, scale=3.)
pos_nodes = _position_nodes(g, partition, scale=1.)
# combine positions
pos = dict()
for node in g.nodes():
pos[node] = pos_communities[node] + pos_nodes[node]
return pos
def _position_communities(g, partition, **kwargs):
# create a weighted graph, in which each node corresponds to a community,
# and each edge weight to the number of edges between communities
between_community_edges = _find_between_community_edges(g, partition)
communities = set(partition.values())
hypergraph = nx.DiGraph()
hypergraph.add_nodes_from(communities)
for (ci, cj), edges in between_community_edges.items():
hypergraph.add_edge(ci, cj, weight=len(edges))
# find layout for communities
pos_communities = nx.spring_layout(hypergraph, **kwargs)
# set node positions to position of community
pos = dict()
for node, community in partition.items():
pos[node] = pos_communities[community]
return pos
def _find_between_community_edges(g, partition):
edges = dict()
for (ni, nj) in g.edges():
ci = partition[ni]
cj = partition[nj]
if ci != cj:
try:
edges[(ci, cj)] += [(ni, nj)]
except KeyError:
edges[(ci, cj)] = [(ni, nj)]
return edges
def _position_nodes(g, partition, **kwargs):
"""
Positions nodes within communities.
"""
communities = dict()
for node, community in partition.items():
try:
communities[community] += [node]
except KeyError:
communities[community] = [node]
pos = dict()
for ci, nodes in communities.items():
subgraph = g.subgraph(nodes)
pos_subgraph = nx.spring_layout(subgraph, **kwargs)
pos.update(pos_subgraph)
return pos
pos = community_layout(G, partition)
colors = [partition[x] for x in G.nodes]
f,ax = plt.subplots()
nx.draw_networkx_nodes(G, pos, node_color=colors,node_size=50,alpha=1,cmap='jet')
nx.draw_networkx_edges(G,pos,alpha=0.1)
This is all too abstract. To make sense of these communities, let's plot the stations on a map and color them according to which community they've been sorted in to. First, I need a list of stations with their lat/long coordinates. For this I recommend my bikedata package for querying bikeshare APIs (see my previous post for details).
import bikedata as bd
bs = bd.BikeShareSystem('mobi_bikes')
sdf = bs.query_station_info()
sdf.head()
For mapping, I'll convert the pandas dataframe into a geopandas dataframe to better handle coordinate projections, and I'll add the community label created when we partitioned the network.
sdf['geometry'] = [Point(xy) for xy in zip(sdf.lon, sdf.lat)]
sdf = geopandas.GeoDataFrame(sdf)
sdf.crs = {'init' :'epsg:4326'}
sdf = sdf.to_crs({'init': 'epsg:3857'})
sdf['partition'] = sdf.name.map(partition)
sdf.head()
lon_min = -123.185
lon_max = -123.056
lat_min = 49.245
lat_max = 49.315
f,ax = plt.subplots(subplot_kw={'projection': ccrs.epsg(3857)},figsize=(7,7))
tile = MapboxStyleTiles(MAPBOX_TOKEN,'mjarrett','ck3gggjkl03gp1cpfm927yo7c')
extent = [lon_min,lon_max,lat_min,lat_max]
ax.set_extent(extent)
ax.add_image(tile,15,interpolation='spline36')
sdf.plot(ax=ax,column='partition', cmap='jet')
ax.outline_patch.set_visible(False)
ax.background_patch.set_visible(False)
The meanings behind the community partitions are now obvious. I'll name the five communities as:
- Broadway corridor
- East Van
- False Creek
- City Center/West End
- Stanley Park/Seawall
I think most people who bike in Vancouver will intuitively agree that this is a sensible partition. There are a few outliers, like the station at 4th and Burrard getting grouped in with the City Center stations. But the partitioning algorithm has a bit of randomness built in and each time it's run it gives slightly different results. It's not surprising that stations at the base of the Burrard Bridge are connected to both downtown by commuter cyclists and False Creek by pleasure cyclists.
Comments
Comments powered by Disqus