Slice & Dice Data Analysis using Pandas

Slice & Dice Data Analysis using Pandas Guido Kollerie @guidok PyGrunn May 9th, 2014

$ who am i

$ who am i gkoller ttys001 May 09 14:35

$ who am i

$ who am i Freelance Software Developer Python whenever I can Though I ve done my share of Perl, Java & C# Living in Amsterdam (for now)

Pandas

What is Pandas? A data analysis library for Python that provides rich data structures and functions designed to make working with structured data fast, easy, and expressive. and... combines the high performance array-computing features of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases Python for Data Analysis - Wes McKinney

What is NumPy? NumPy is an extension to the Python programming language, adding support for large, multidimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. https://en.wikipedia.org/wiki/numpy

Installing Pandas

Installing Pandas in 8 minutes

Installing Pandas in 8 minutes $ pyvenv-3.4 env $ source env/bin/activate $ pip -v install pandas \ ipython[all] \ matplotlib

Why IPython? IPython Notebooks web-based interactive computational environment combines code execution, text, mathematics, plots and rich media into a single document A must-have for Pandas development

Starting IPython $ ipython notebook

Starting IPython $ ipython notebook --pylab=inline

IPython Notebook

Pandas Data Structures Series DataFrame Panel (won t cover)

Common Imports import pandas as pd from pandas import Series, DataFrame

Series Array like data structure Backed by a NumPy array

Series - Creation s = Series(randn(5))!! 0 1.523850 1-1.013846 2 0.844459 3-0.316547 4-0.476972 dtype: float64

Series - Creation With a specified index s = Series(randn(5), index=list('abcde'))!! a -0.190127 b -1.349079 c 1.294381 d 0.045708 e 1.630447 dtype: float64

Series - Creation Using a dictionary s = Series(dict(one=1, two=2, three=3, four=4, five=5))! five 5 four 4 one 1 three 3 two 2 dtype: int64

Series - Selection s['one'] # as a dictionary 1! s[0] # as a list 5! s[:2] # slice operation five 5 four 4 dtype: int64! s[s > 3] # boolean based indexing five 5 four 4 dtype: int64

Series - Operations s.min(), s.max(), s.mean(), s.sum() (1, 5, 3.0, 15)! s * 2 # vector based operations five 10 four 8 one 2 three 6 two 4 dtype: int64

Series - Operations User defined import string s.apply(lambda x: string.ascii_letters[x])! five f four e one b three d two c dtype: object

Series - Operations Vector based u = Series(randn(5)) v = Series(randn(7)) u + v! 0-1.505072 1 3.716130 2-0.294903 3-1.323626 4-1.517751 5 NaN 6 NaN dtype: float64

DataFrame 2D array like Labelled rows and columns Heterogeneously typed

DF - Creation df = DataFrame(dict(foo=[1,2,3,4], bar=[5.,6.,8.,9.])) df.dtypes bar float64 foo int64 dtype: object

DF - Creation With a specified index df1 = DataFrame(dict(foo=[1,2,3,4], bar=[5.,6.,8.,9.]), index=['one','two','three','four'])

DF - Creation Using dictionaries df2 = DataFrame(dict(u=u, v=v)) # dict of Series

DF - Creation From a CSV file df3 = pd.read_csv('04. Inschrijvingen wo_tcm33-32296.csv', sep=';', encoding= latin-1') df3.head() http://data.duo.nl/organisatie/open_onderwijsdata/databestanden/ho/ingeschreven/ingeschrevenen_wo/wo_inschrijvingen.asp

DF - Selection df1['bar'] # column selection df1.bar # column selection via attribute one 5 two 6 three 8 four 9 Name: bar, dtype: float64

DF - Selection df1.loc['one'] # row selection by label df1.iloc[0] # row selection by integer bar 5 foo 1 Name: one, dtype: float64! df1[1:3] # slice rows

DF - Selection df1[df1['bar'] > 6] # select rows by boolean vector

DF - Addition/Deletion df1['foobar'] = df1['foo'] + df1['bar']

DF - Addition/Deletion df1['zero'] = 0 # assign a scalar value to column del df1['foobar']

DF - Merging/Joining # straight from the pandas' documentation left = DataFrame({'key1': ['foo', 'foo', 'bar'], 'key2': ['one', 'two', 'one'], 'lval': [1, 2, 3]})!! right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], 'key2': ['one', 'one', 'one', 'two'], 'rval': [4, 5, 6, 7]})

DF - Merging/Joining left right

DF - Merging/Joining pd.merge(left, right, how='outer') pd.merge(left, right, how='inner')

Some Simple Data Analysis

Number of Students wo = df3 # wo -> Wetenschappelijk Onderwijs wo.rename(columns=str.lower, inplace=true) wo.columns! Index(['provincie', 'gemeentenummer', 'gemeentenaam', 'brin nummer actueel', 'instellingsnaam actueel', 'croho onderdeel', 'croho subonderdeel', 'opleidingscode actueel', 'opleidingsnaam actueel', 'opleidingsvorm', 'opleidingsfase actueel', '2008 man', '2008 vrouw', '2009 man', '2009 vrouw', '2010 man', '2010 vrouw', '2011 man', '2011 vrouw', '2012 man', '2012 vrouw'], dtype=object)

Select Columns men = [c for c in wo.columns.tolist() if c.endswith('man')] women = [c for c in wo.columns.tolist() if c.endswith( vrouw )]! # croho -> Centraal Register Opleidingen Hoger Onderwijs uni = wo[['instellingsnaam actueel', 'croho onderdeel'] + men + women]

Slice & Dice

Group By & Sum # math operations ignore nuisance columns (non-numeric cols) uni_size = uni.groupby('instellingsnaam actueel').sum()

(Men + Women) / Year uni_size.index.name = Instelling! uni_size['2008'] = uni_size['2008 man'] + uni_size['2008 vrouw'] uni_size['2009'] = uni_size['2009 man'] + uni_size['2009 vrouw'] uni_size['2010'] = uni_size['2010 man'] + uni_size['2010 vrouw'] uni_size['2011'] = uni_size['2011 man'] + uni_size['2011 vrouw'] uni_size['2012'] = uni_size['2012 man'] + uni_size['2012 vrouw ]! # select only relevant columns uni_size = uni_size[[ 2008','2009','2010','2011','2012']]

Sort uni_size.sort(columns='2012', ascending=false).head()

Sort # axis=0 -> by row, axis=1 -> by column uni_size.sort(axis=1, ascending=false)

uni_size.sort( axis=1, ascending=false).sort( columns='2012', ascending=false).plot( kind= barh', figsize=[10,10])

Quick One

Men/Women Diffs opl = wo.groupby('opleidingsnaam actueel').sum() opl = opl[['2012 man', '2012 vrouw']] opl['diff'] = opl['2012 man'] - opl['2012 vrouw'] sorted_opl = opl.sort(columns='diff', ascending=false)! top5_max = sorted_opl[:5] top5_min = sorted_opl[-5:] top5 = pd.concat([top5_max, top5_min])! top5['diff'].plot(kind='barh')

Men/Women Diffs

CS like studies cs_crit = wo['opleidingsnaam actueel ].\ str.contains('informatica') cs = wo[cs_crit] cs['opleidingsnaam actueel ].value_counts()! B Informatica 12 B Technische Informatica 8 Technische Informatica 6 Informatica 4 B Economie en Informatica 2 M Informatica 2 M Lerarenopleiding Informatica 1 dtype: int64

Pivot Tables pv = pd.pivot_table(wo, values=['2012 man', '2012 vrouw'], rows=['instellingsnaam actueel'], cols=['croho onderdeel'], fill_value=0, aggfunc=np.sum)

Pivot Tables

Multi-level index/columns Pivot Tables

Stack pv.stack(1)

Unstack pv.loc['universiteit van Amsterdam']! croho onderdeel 2012 man economie 2650 gedrag en maatschappij 2727 gezondheidszorg 1248 landbouw en natuurlijke omgeving 0 natuur 2152 onderwijs 125 recht 1573 sectoroverstijgend 365 taal en cultuur 2795 2012 vrouw economie 1448 gedrag en maatschappij 5703 gezondheidszorg 2077 landbouw en natuurlijke omgeving 0 natuur 1344 onderwijs 124 recht 2305 sectoroverstijgend 497 taal en cultuur 4725 Name: Universiteit van Amsterdam, dtype: int64

Unstack Multi-level index pv.loc['universiteit van Amsterdam']! croho onderdeel 2012 man economie 2650 gedrag en maatschappij 2727 gezondheidszorg 1248 landbouw en natuurlijke omgeving 0 natuur 2152 onderwijs 125 recht 1573 sectoroverstijgend 365 taal en cultuur 2795 2012 vrouw economie 1448 gedrag en maatschappij 5703 gezondheidszorg 2077 landbouw en natuurlijke omgeving 0 natuur 1344 onderwijs 124 recht 2305 sectoroverstijgend 497 taal en cultuur 4725 Name: Universiteit van Amsterdam, dtype: int64

Unstack! croho onderdeel 2012 man economie 2650 gedrag en maatschappij 2727 gezondheidszorg 1248 landbouw en natuurlijke omgeving 0 natuur 2152 onderwijs 125 recht 1573 sectoroverstijgend 365 taal en cultuur 2795 2012 vrouw economie 1448 gedrag en maatschappij 5703 gezondheidszorg 2077 landbouw en natuurlijke omgeving 0 natuur 1344 onderwijs 124 recht 2305 sectoroverstijgend 497 taal en cultuur 4725 Name: Universiteit van Amsterdam, dtype: int64

Unstack uva = pv.loc['universiteit van Amsterdam'].unstack()

Transpose uva.t

Lots more Reording levels within multi-level indices Time Series (with up- & downsampling) Linear Regression (via statsmodels)

Want to learn more?