Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

workflow-flowcharts-update

Workflow flowcharts - Update

A while back I posted about my work with the Rgraphviz toolbox toward a wrapper function that will allow me to track the connections between chunks of my code as I write it. This update includes notes from a discussion I had with Keith about this.

The basic use case is described by this code:

require(disentangle)
# Decide on the first step and create the starting node of the flowchart 
nodes <- newnode(name="NAME", inputs="INPUT", newgraph=T)
# FirstStep.
# Comment: do a bit of work
NAME <- myWorkFunction(INPUT)
# Decide on next step
nodes <- newnode(name="OUTPUT", inputs=c("NAME","ANOTHER THING"))
# Do the second step
OUTPUT <- mySecondFunction(NAME, ANOTHER_THING)

workflow-flowchart1.png

So as the scripted workflow develops, so does the flow chart. The alternative I described previously is to make a list of all the steps in a Balzerian Filelist Table.

Keith’s comments:

Keith gave me some great comments on that post. I’ll just jot them down here for now but will have to come back and re-write parts of my function and the descriptive blog post to address these at another time.

First section:

  • I said that multiple datasets used in an analysis could be broken down into discrete data packages. Keith asked if I thought they “should be”.
  • questioned whether the data warehouse at my work really is “Big Data”
  • I need to cite the graphviz and Rgraphviz software better
  • the function currently requires users to speciify newgraph = T to start, and fails to do anything if “nodes” object doesn’t exist. This is uneccessary. should make it create “nodes” if doesn’t exist. Then the newgraph argument is only needed to delete an old graph and start again.
  • is it required that any of these actually be new? if not then “newnode” seems a poor name for the function

From section “Code:adding nodes”:

  • the line where output is the name is confusing. Keith felt that output was not the name of the newnode (but it really is).
  • the name of the nodes object is hardcoded to be “nodes”. therefore don’t need to give it as an argument
  • proposed better syntac: newnode(name= “ANOTHER THING”, output = “OUTPUT”) which I think will also work?
  • wouldn’t it be great if links could be labelled? maybe better to focus on links rather than nodes? this seems more logical because links are transitions and where the action is, not at the nodes.
  • it might be helpful to review common actions to see what graph structures each implies.
  • RE lab-book from Chemistry: most people would not be familiar with how lab-books are written.
  • can newnode build circuits/cycles? can it add links between existing nodes?

Section after “shows this result”

  • tablexyz -> FileA -> Input1 should be one step because fileA is not used again. ditto fileB.
  • analysisResults is a different kind of thing. this is confusing.
  • why do I have to type “nodes <-“ every time? if it has to have this name It could be hidden

Under example two, “Step three”

  • huh? does input1 have “id”?

  • “the result” graph is a bit dense and hard to follow. Perhaps simpler labels? ie instead of FileA just “A”?

Posted in  research methods


introduction-to-ons-theory-and-practice

The reasons why ONS is so appealing to me can be broken into two parts. The first is about the problems which it solves, the second is about the benefits it might bring.

The Problems

  • Identifying errors (either from miscalculations or from methodological mistakes)
  • Uncovering fraud

The Benefits

  • Sharing interests and skills
  • Quickly finding out about new discoveries and failures

These benefits come from the enhanced potential for theoretical discussions and sharing ideas. This is especially valuable around difficult theories, unknown issues or esoteric theories that are known only to a specialist in a field.

Posted in  research methods


using-orgmode-and-jekyll-for-open-notebook

Using Orgmode and Jekyll for Open Notebook

Orgmode is a great notebook tool because it allows the coding, evaluation and documentation all in one. I also want to use it to send the documentation to my blog as an Open Notebook.

If starting again I’d look into this:

But as it is I already put a lot of work into configuring a jekyll blog I cloned from Scott Chamberlain over at ROpenSci and I will just use orgmode to publish the posts related to each project, tagged as ‘categories’.

But here is a problem I just found out how to solve. For a long time I thought that because github disabled ruby plugins that the automatic generate categories index pages was broken. Luckily Charlie Park has written up the following solution and this seems to have worked for me today:

Cheers!

Posted in  research methods


toward-a-unified-ecology-dataset

Ecology datasets

These are my notes from a meeting that Kathryn and I had with a group of Ecologists at the ANU (primarily Luciana Porforio and Nasreen Khan). We asked them to discuss how they search for and use Ecology datasets, especially how to best package up the parts of an ecological field data collection (ie weather, vegetation, biodiversity, soils, topography etc).

Lu started off the discussion by stating that the most important thing to acknowledge is that every ecologist will start off with a main research question and then search for data that will address their specific research question. It is difficult to work from a ‘top-down’ perspective that hopes to pre-empt the range of possible questions. Lu felt that it may therefore be best to just keep all the data together in the biggest bundle that is possible and the end user can pick it apart once downloaded.

We explained that LTERN datasets can be quite expansive with many dimensions and it seemed preferable to at least untangle the main ‘themes’ for packaging up.

Nasreen pointed out that there is always a protocol for how data are collected and this should give the data collection it’s structure. However I felt that ecology collections are so diverse they have been made (by necessity) very flexible and specific to the needs of the individual plot network. Therefore generalisations across data collections are very hard to make (apart from easy things like “weather” or “aboveground dead biomass”).

Toward a Unified Ecology

I always fall back on the text book “Toward a Unified Ecology: Timothy F. H. Allen, Thomas W. Hoekstra 1992”. I wondered if it can guide us? On pages 42-53 they describe the following framework and use the image below (the letters in the middle disc correspond to the criteria ie O = Organism). In this framework it is possible to summarise ANY ecological study as they ALWAYS incorporate these scale-independent criteria:

  • Organism: genetic integrity, discrete body, autonomy from other organisms
  • Population: relative similarity within the group
  • Community: inter-species competition, interference, mutualism
  • Ecosystem: biotic and abiotic interactions
  • Landscape: spatial structure/contiguity
  • Biome: characteristic physiognomy, disturbance and climate

I thought that if these dimensions were identified in a data collection first then they might become the discrete packages by which each plot network publishes their collection?

datadoco-layercake.png

Aekos

Over at AEKOS they have a similar conceptual framework

Observations can range from that of individual organisms and
interactions, through to populations, communities, ecosystems and
across broad global landscapes.

Conclusions

This is an open issue. More discussions are needed internally for the Data Custodians.

Lu also pointed out that the end users are the key stakeholders and perhaps more input from them (via surveys and workshops?) is needed?

Posted in  Data Documentation


Starting my Open Notebook Science Blog

Many examples are emerging of scientists who are transitioning to a much more open model of research. This is in part externally driven by funding bodies (such as the Aussie Research Council asking for deposit of funded data and papers) and journals (ie. Nature journals removing length restrictions on Methods sections.). Also the increased value being placed on transparency of reproducible analysis to safeguard against error and fraud is becoming an internal driver within science communities.

Open Notebook Science (ONS) style is an extreme of transparent approaches to research. According to the wikipedia page it is the “practice of making the entire primary record of a research project publicly available online as it is recorded”.

That’s pretty extreme! In my view a lot of stuff in the research project should probably be archived quickly and left to rot.

I like the range of options available. I think I’ll go for SCD or “Seclected Content / Delayed” and show their image below. In this model a portion of the open notebook and associated supporting raw data are available after some delay. I’ll try to use this blog for weekly updates on progress for each project, and provide links off my ‘Open Notebook’ and ‘Software’ Tabs.

ONS-SCD.png

Posted in  research methods