Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

librarians-and-python

I stumbled on these posts about python and IPython Notebook by “Data Scientist Training for Librarians”:

I;ve been wanting to learn more python. I don’t think it’ll be ready for statistical modelling for a while, but I a want to be ready when it is. You can get my ipython notebook file for this here: olive.ipynb, but first run the following R code snippet to get the replication dataset ‘olive.csv’ into your home directory.

R Code: get the olive oil dataset

install.packages("pgmm")
require(pgmm)
data(olive)
names(olive) <- tolower(names(olive))
str(olive)
write.csv(olive, "olive.csv", row.names = F)
# actualy home might be easier for ipython to access
write.csv(olive, "~/olive.csv", row.names = F)

OK now reproduce the example

I quite like the histograms from the second example. Here is the raw code extracted from the ipynb

Code:

%pylab inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.colors as colors
 
acidlist=['palmitic', 'palmitoleic', 'stearic', 'oleic', 'linoleic', 'linolenic', 'arachidic', 'eicosenoic']
dfsub=df[acidlist].apply(lambda x: x/100.0)
dfsub.head()
 
rkeys=[1,2,3]
rvals=['South','Sardinia','North']
rmap={e[0]:e[1] for e in zip(rkeys,rvals)}
rmap
 
fig, axes=plt.subplots(figsize=(10,20), nrows=len(acidlist), ncols=1) # sets up the framework for the graphs. Acidlist is defined elsewhere, and is a list of the acids we’re interested in.
i=0 # Sets a counter to 0
 
for ax in axes.flatten(): # Starts a loop to go through our plot and render each row
 
    acid=acidlist[i]
    seriesacid=df[acid] # creates seriesacid and sets it to df[acid], a list of the percent composition of the acid in the current iteration that’s in each olive oil.
 
    minmax=[seriesacid.min(), seriesacid.max()] # the minimum and maximum values plotted will be the minimum and maximum percentages that we find in the data
 
    for k,g in df.groupby('region'):  # starts a loop in the loop to plot the values by region
        style = {'histtype':'stepfilled', 'alpha':0.5, 'label':rmap[k], 'ax':ax}
        g[acid].hist(**style)
 
        ax.set_xlim(minmax)
 
        ax.set_title(acid)
 
        ax.grid(False)
 
        #construct legend
        ax.legend()
    i=i+1 # increments the counter, to move the loop on to the next acid.
 
    fig.tight_layout()

acid.png

Conclusions

  • I am not sure how to do the transparency but the rest of it would make more sense to me with R
  • Will try to reproduce in R for head-to-head shoot out.
  • I also plan to look into the new shinyAce browser based R editor for similar work
  • I’m still really enjoying Emacs Orgmode for this kind of functionality though and thoroughly recommend KJ Healy’s starter kit
  • (I’ve installed and config thison Ubuntu, Windoze and Mac, but LaTeX and exporting R code can be tricky)

Posted in  Data Documentation


blog comments powered by Disqus