Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

Organising Graph Nodes And Edges In A Dataframe

I use the R package DiagrammeR for creating graphs (the formal kind, connecting nodes with edges)

  • This package is great and I like how it interacts with the Graphviz program
  • One thing that I like to do in planning and organising data analysis projects is to make graphs and lists of the methods steps, inputs and Outputs
  • A simple way to organise these things is in a dataframe (table) with a column for each step (node) and two others for inputs and outputs (edges)
  • In my utilities R package github.com/ivanhanigan/disentangle I have written functions that turn this table into a graphiviz DOT language script
  • Recently I have needed to unpack the list for a more itemized view
  • Both these functions are showcased below

Code: newnode

# First create some test data, each step is a collection of edges 
# with inputs or outputs simple comma seperated lists
dat <- read.csv(textConnection('
cluster ,  step    , inputs                    , outputs                                , description                      
A  ,  siteIDs      , "GPS, helicopter"          , "spatial, site doco"                 , latitude and longitude of sites  
A  ,  weather      , BoM                       , exposures                              , weather data from BoM            
B  ,  trapped      , spatial                   , trapped_no                             , counts of species caught in trap 
B  ,  biomass      , spatial                   , biomass_g                              ,                                  
B  ,  correlations , "exposures,trapped_no,biomass_g" , report1                         , A study we published             
C  ,  paper1       , report1                   , "open access repository, data package" ,                                  
D  ,  biomass revision, new estimates          , biomass_g                              , this came late
'), stringsAsFactors = F, strip.white = T)    
str(dat)
# dat
      
# Now run the function and create a graph
nodes <- newnode(
  indat = dat,
  names_col = "step",
  in_col = "inputs",
  out_col = "outputs",
  desc_col = "description",
  clusters_col = "cluster",
  nchar_to_snip = 40
  )  
sink("Transformations.dot")
cat(nodes)
sink()
#DiagrammeR::grViz("Transformations.dot")
system("dot -Tpng Transformations.dot -o Transformations.png")
browseURL("Transformations.png")

That creates this diagram

/images/Transformations.png

Now to showcase the tool that itemizes this list of inputs and outputs

  • The original table has no capacity to add detail about each node as they are held as a list of inputs and outputs
  • To add detail for each we need to unpack each list and create a new table with one row per node
  • I decided to make this a long table with an identifier for each node about which step (edge) the node is an input or an output

Code: newnode_csv

nodes_graphy <- newnode_csv(
  indat = dat,
  names_col = "step",
  in_col = "inputs",
  out_col = "outputs",
  clusters_col = 'cluster'
  ) 
# which creates this table
knitr::kable(nodes_graphy)
|cluster |name             |in_or_out |value                  |
|:-------|:----------------|:---------|:----------------------|
|A       |siteIDs          |input     |GPS                    |
|A       |siteIDs          |input     |helicopter             |
|A       |siteIDs          |output    |spatial                |
|A       |siteIDs          |output    |site doco              |
|A       |weather          |input     |BoM                    |
|A       |weather          |output    |exposures              |
|B       |trapped          |input     |spatial                |
|B       |trapped          |output    |trapped_no             |
|B       |biomass          |input     |spatial                |
|B       |biomass          |output    |biomass_g              |
|B       |correlations     |input     |exposures              |
|B       |correlations     |input     |trapped_no             |
|B       |correlations     |input     |biomass_g              |
|B       |correlations     |output    |report1                |
|C       |paper1           |input     |report1                |
|C       |paper1           |output    |open access repository |
|C       |paper1           |output    |data package           |
|D       |biomass revision |input     |new estimates          |
|D       |biomass revision |output    |biomass_g              |

This can now be useful for making a ‘shopping list’ of the data to aquire, transform, analyse or archive.

Posted in  disentangle


blog comments powered by Disqus