Storing metadata in Pandas DataFrames
I came across a problem recently where I wanted to store metadata about each column in a pandas DataFrame. This could be done easily enough with a simple dictionary where each column name is a key, but I was hoping for something that would propogate with a DataFrame as it's sliced, copied, expanded, etc, without having to explicitely keep track of the metadata in my scripts.
My solution is to use the pandas MultiIndex. In this post I'll discuss the pros and cons of this approach, show some examples, a show how I've subclassed the pandas DataFrame to make this approach easier.
The Pandas MultiIndex¶
In my usage case, I want each column to carry three pieces of metadata: a field name, a field width, and a field description. The field name is a short unique label, the width is an integer that describes the number of characters the field will occupy when saved to a flat file, and the description is a longer string explaining the field.
To start, I'll build a simple dataframe with these attributes.
import pandas as pd
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
Now I have a DataFrame with 2 columns, with three levels to each column name: NAME
, LENGTH
and DESCRIPTION
. Each column can be uniquely identified by a tuple with the three levels:
df[('A',1,'desc A')]
The above returns a pandas Series, the 1D foundation of the DataFrame.
I can also call a column with only the top level label:
df['A']
This, however, returns a DataFrame, and that's a problem for me. There are methods available to Series that aren't available to single-column DataFrames, so this breaks the natural Pandas workflow. I'd like to refer to a column only by the NAME
level of the MultiIndex and have it behave as if the column labels only had a single level. This isn't at all the usage that the pandas designers expected when they defined the MultiIndex.
With plain pandas I'm stuck now: to refer to a column I'd need to pass the tuple with the full metadata, which is too anoying. I could also write a helper function that returns a tuple if given a NAME
string, but this is still a lot of extra typing and makes my code basically unreadable to someone who is used to pandas but doesn't understand my personal workflow.
Subclassing¶
The answer is to subclass the pandas DataFrame and override the indexing operator so that in certain cases a dataframe with a MultiIndex can have columns called only by a single level label.
In order not to loose the normal MultiIndex functionality, I require that one of the column levels is called NAME
and that labels on this level are unique. If there is a MultiIndex without a NAME
level, standard Pandas behaviour is defaulted to.
class MetaDataFrame(pd.DataFrame):
@property # Required so that our new class returns an instance of itself
def _constructor(self):
return MetaDataFrame
# __getitem__ is the special method called by the indexing operator
def __getitem__(self, key):
# Check whether the current instance is using a MultiIndex,
# there is a level called NAME, and the input key is a string
# (not a list)
is_mi_columns = isinstance(self.columns, pd.MultiIndex)
if is_mi_columns and 'NAME' in self.columns.names and type(key)==str:
# get the full tuple corresponding to the specified key
tupkey = [x for x in self.columns if x[0]==key]
# Check that the key is unique. If it is, set 'key' to the tuple
if len(tupkey) == 1:
key = tupkey[0]
# If it's not unique, raise and error
elif len(tupkey) > 1:
raise ValueError('NAME level column labels must be unique')
# Now if the above conditions are met, 'key' is the tuple, or
# else 'key' is unchanged. Either way, we pass the key
# to the __getitem__ method of the parent class
return super().__getitem__(key)
Now let's make the same dataframe as before, but using the new MetaDataFrame class.
mdf = MetaDataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
mdf.columns.names=['NAME','LENGTH','DESCRIPTION']
mdf
Calling the dataframe by a NAME level label now returns a Series, as if we weren't using a MultiIndex.
mdf['A']
If I pass anything other than a string corresponding to a NAME level label, it defaults to the normal DataFrame behavious.
mdf[['A','B']]
mdf[('A',1)]
try:
mdf['desc A']
except KeyError:
print('Key not found')
There are other benefits to these changes to __getitem__
. For example, when performing a merge between two dataframes I only need to specify the NAME level labels to merge on, instead of the full tuple as you would with the default class.
mdf2 = MetaDataFrame({('A',1,'desc A'):[1,2,3],('C',3,'desc C'):[7,8,9]})
mdf2.columns.names=['NAME','LENGTH','DESCRIPTION']
pd.merge(mdf,mdf2,left_on='A',right_on='A',how='inner')
Subclassing further...¶
In my useage case, the metadata is useless to me until I'm saving the file. I want to clean up how the dataframes are displayed so that all that extra information isn't printed out every time, so I'll override the __unicode__
method. (This applies to Python3. In Python2, you want the __str__
method). Here I basically tell the dataframe that in my specific usage case (MultiIndex with a NAME level) to display a version of the dataframe with a single level column index.
class MetaDataFrame(pd.DataFrame):
@property
def _constructor(self):
return MetaDataFrame
def __getitem__(self, key):
is_mi_columns = isinstance(self.columns, pd.MultiIndex)
if is_mi_columns and 'NAME' in self.columns.names and type(key)==str:
tupkey = [x for x in self.columns if x[0]==key]
if len(tupkey) == 1:
key = tupkey[0]
elif len(tupkey) > 1:
raise ValueError('NAME level column labels must be unique')
return super().__getitem__(key)
def __unicode__(self):
is_mi_columns = isinstance(self.columns, pd.MultiIndex)
if is_mi_columns and 'NAME' in self.columns.names:
newdf = self.copy()
newdf.columns = self.columns.get_level_values('NAME')
return newdf.__unicode__()
return super().__unicode__()
def _repr_html_(self):
is_mi_columns = isinstance(self.columns, pd.MultiIndex)
if is_mi_columns and 'NAME' in self.columns.names:
newdf = self.copy()
newdf.columns = self.columns.get_level_values('NAME')
return newdf._repr_html_()
return super()._repr_html_()
mdf3 = MetaDataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
mdf3.columns.names=['NAME','LENGTH','DESCRIPTION']
mdf3
Note that I do the same for both the __unicode__
and _repr_html_
methods. The former is the case I'm most interested since I'm usually working in a terminal, while the later defines how to format a DataFrame to html which is primarily used for Jupyter notebooks (like this blogpost). There's lots of great information on further formating DataFrames on the Pandas website.
At this point, you can write functions to generate a MetaDataFrame, manipulate it with all the great Pandas functionality, and always have access to your metadata. In my case, I use the metadata for formatting the text file that I save my DataFrame to as well as generating a layout file that describes the fields.
Comments
Comments powered by Disqus