Disentangle Things by Ivan Hanigan

Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

Organising Graph Nodes And Edges In A Dataframe

I use the R package DiagrammeR for creating graphs (the formal kind, connecting nodes with edges)

This package is great and I like how it interacts with the Graphviz program
One thing that I like to do in planning and organising data analysis projects is to make graphs and lists of the methods steps, inputs and Outputs
A simple way to organise these things is in a dataframe (table) with a column for each step (node) and two others for inputs and outputs (edges)
In my utilities R package github.com/ivanhanigan/disentangle I have written functions that turn this table into a graphiviz DOT language script
Recently I have needed to unpack the list for a more itemized view
Both these functions are showcased below

Code: newnode

# First create some test data, each step is a collection of edges 
# with inputs or outputs simple comma seperated lists
dat <- read.csv(textConnection('
cluster ,  step    , inputs                    , outputs                                , description                      
A  ,  siteIDs      , "GPS, helicopter"          , "spatial, site doco"                 , latitude and longitude of sites  
A  ,  weather      , BoM                       , exposures                              , weather data from BoM            
B  ,  trapped      , spatial                   , trapped_no                             , counts of species caught in trap 
B  ,  biomass      , spatial                   , biomass_g                              ,                                  
B  ,  correlations , "exposures,trapped_no,biomass_g" , report1                         , A study we published             
C  ,  paper1       , report1                   , "open access repository, data package" ,                                  
D  ,  biomass revision, new estimates          , biomass_g                              , this came late
'), stringsAsFactors = F, strip.white = T)    
str(dat)
# dat
      
# Now run the function and create a graph
nodes <- newnode(
  indat = dat,
  names_col = "step",
  in_col = "inputs",
  out_col = "outputs",
  desc_col = "description",
  clusters_col = "cluster",
  nchar_to_snip = 40
  )  
sink("Transformations.dot")
cat(nodes)
sink()
#DiagrammeR::grViz("Transformations.dot")
system("dot -Tpng Transformations.dot -o Transformations.png")
browseURL("Transformations.png")

That creates this diagram

/images/Transformations.png

Now to showcase the tool that itemizes this list of inputs and outputs

The original table has no capacity to add detail about each node as they are held as a list of inputs and outputs
To add detail for each we need to unpack each list and create a new table with one row per node
I decided to make this a long table with an identifier for each node about which step (edge) the node is an input or an output

Code: newnode_csv

nodes_graphy <- newnode_csv(
  indat = dat,
  names_col = "step",
  in_col = "inputs",
  out_col = "outputs",
  clusters_col = 'cluster'
  ) 
# which creates this table
knitr::kable(nodes_graphy)
|cluster |name             |in_or_out |value                  |
|:-------|:----------------|:---------|:----------------------|
|A       |siteIDs          |input     |GPS                    |
|A       |siteIDs          |input     |helicopter             |
|A       |siteIDs          |output    |spatial                |
|A       |siteIDs          |output    |site doco              |
|A       |weather          |input     |BoM                    |
|A       |weather          |output    |exposures              |
|B       |trapped          |input     |spatial                |
|B       |trapped          |output    |trapped_no             |
|B       |biomass          |input     |spatial                |
|B       |biomass          |output    |biomass_g              |
|B       |correlations     |input     |exposures              |
|B       |correlations     |input     |trapped_no             |
|B       |correlations     |input     |biomass_g              |
|B       |correlations     |output    |report1                |
|C       |paper1           |input     |report1                |
|C       |paper1           |output    |open access repository |
|C       |paper1           |output    |data package           |
|D       |biomass revision |input     |new estimates          |
|D       |biomass revision |output    |biomass_g              |

This can now be useful for making a ‘shopping list’ of the data to aquire, transform, analyse or archive.

Posted in disentangle