Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

If You Don't Find A Solution In R, Keep Googling!

I’ve learnt this lesson multiple times. It happens like this. A solution is not immediately obvious in R so you might think of writing your own function. Generally there is a solution you just did not google enough. This time I was tricked a little because the GIS functions have been bad for a long time but getting better very rapidly recently. A little while ago I had a very successful outcome from using the raster::extract function on a large raster file to get the attributes for a set of points. I needed to do the same thing but this time for a shapefile and points. I looked at the raster package and saw you can use the raster::intersect function here, and it worked on the small sample data I tested with but failed with the big dataset as it ran out of memory. I assumed that R had not caught up with the GIS world yet and so I came up with this workaround below by splitting the points data layer into chunks.

I then got access to ArcMap and was wondering whether it could do it, and it DID! So then I googled a bit and found the solution was simple:

Code:

sp::over()

Here is my hack in case I ever need to pull out the bit that does the splitting up of the points file, or the tryCatch():

Code:

big_pt_intersect <- function(pts, ply, chunks = 100){
  idx <- split(pts@data, 1:chunks)
  #str(idx)
  for(i in 1:length(idx)){
  #i = 1
  print(i)
    ids <- idx[ [i] ][,1]
  #str(pts@data)
  qc <- pts[pts@data[,1] %in% ids,]
  #str(qc)
  tryCatch(
    chunk <-  raster::intersect(qc, ply), 
    error = function(err){print(err)})
  if(!exists('chunk_out')){
  
    chunk_out <- chunk@data
  } else {
    chunk_out <- rbind(chunk_out, chunk@data)
  }
  rm(chunk)
  
  }
  #str(chunk_out)
  return(chunk_out)
}
# NB warning about split length multiple is not fatal, just due to nonequal chunks 
# (ie the geocodes are 2009/100)

Posted in  disentangle


Templates are Needed for Reproducible Research Reports (that Look Good)

I read with interest the the Transparency and Openness Promotion (TOP) Committee templates for guidelines to enhance transparency in the science that journals publish.

Citation

Supplementary Materials for Nosek, B. A., Alter, G., Banks, G. C.,
Borsboom, D., Bowman, S. D., Breckler, S. J., … Yarkoni,
T. (2015). Promoting an open research culture. Science, 348(6242),
1422–1425. doi:10.1126/science.aab2374

I think though that guidelines like the suggestion to copy-paste bits of the manuscript leave a bit to be desired:

Quote;

Authors document compliance by copy-pasting the relevant passages
in the paper that address the question into the form. For example,
when indicating how sample size was determined, authors copy paste
into the form the text in the paper that describes how sample size
was determined.

Reproducible Research Reports solve this problem by ensuring that the data preparation and analysis are executed in the same script that produces the manuscript, therefore a one-stop-shop for documentation of the entire study.

There is a need for Templates of Reproducible Research Reports (that look good!)

Rstudio provides very easy support for these documents if you use R. In particular the option of a menu button to create a new report populates that report with the required header information and some example script to work off. But the easiest option does not look so good. This is the Rmarkdown option and it is very user friendly in terms of the markup language needed to write the descriptive language around your analysis (mostly plain text with a few simple options for heading styles etc) rather than the Sweave option which leads to the full blown LaTeX markup language that is a lot more complicated.

Boilerplate Rmarkdown header from Rstudio:

---
title: "Untitled"
author: "Ivan C. Hanigan"
date: "16 September 2015"
output: html_document
---

This is great for quick reporting of work as you go, but I primarily write for output that will be printed (e.g. pdf docs). More specifically, I need the concept of a page, and to have full control over the placement of table and figure ‘environments’, stuff that is easy in LaTeX (once you figure out some of the esoteric parts of that language).

To achieve a simple writing environment in Markdown but with the powerful layout options of LaTeX I reviewed this guys work but I think it takes it to an uneccessary level of complicated-ness https://github.com/jhollist/manuscriptPackage.

So I went back to some of the old Sweave/Latex templates I had put together and ported it into a markdown header.

Boilerplate Rmarkdown header for pretty report

---
title: "Untitled"
author: "Ivan C. Hanigan"
date: "16 September 2015"
header-includes:
  - \usepackage{graphicx}
  - \usepackage{fancyhdr} 
  - \pagestyle{fancy} 
  - \usepackage{lastpage}
  - \usepackage{float} 
  - \floatstyle{boxed} 
  - \restylefloat{figure} 
  - \usepackage{url} 
  - \usepackage{color}
  - \lhead{Left Header}
  - \chead{Rmarkdown Rocks}
  - \rhead{\today}
  - \lfoot{Left Footer}
  - \cfoot{Centre Footer}
  - \rfoot{\thepage\ of \pageref{LastPage}}  
output: 
  pdf_document:
    toc: false
documentclass: article
classoption: a4paper
bibliography: references.bib
---

Now the layout of tables and figures is done with latex

Code

Using the xtable package allows results to be displyed in tables
and has built in support for some R objects, so summrising the
linear fit above in ~\ref{ATable}
  
```{r, results='asis', type = 'tex'}
library(xtable)    
print(xtable(fit, caption="Example Table",
  digits=4,table.placement="ht",label="ATable"), comment = F)    
```
   
## A Plot
   
Plots intergrate most easily if made seperately as can be seen in figure ~\ref{test}
```{r}
png("Rmarkdownfig.png")
plot(x,y,main="Example Plot",xlab="X Variable",ylab="Y Variable")
abline(fit,col="Red")
dev.off()
```
\begin{figure}[H]
\begin{center}
\includegraphics[width=.5\textwidth]{Rmarkdownfig.png}
\end{center}
\caption{Some Plot}
\label{test}
\end{figure}
\clearpage

I also realised that if this was to be a full report of a scientific study it would need to include some of the machinery needed for bibliographies.

Stuff for bibliographies

```{r, echo=F, results = 'hide', message = F, warning=F}
library("knitcitations")
library("bibtex")
cleanbib()
cite_options(citation_format = "pandoc", check.entries = FALSE)
 
bib <- read.bibtex("C:/Users/Ivan/Dropbox/references/library.bib")
 
```

<!--Put data analysis and reporting here, then at the end of the doc-->

```{r, echo=F, message=F, eval=T}
write.bibtex(file="references.bib")
```
      
# References

<!--The bib will then be written following the final subheading-->

Conclusion

I hope this might help others develop their own templates for RRR that look great.

Posted in  disentangle reproducible research reports


task-management-like-an-open-science-hacker

I just read this impressive paper and it has really given me a push toward making this open lab notebook

Citation

Nosek, B. A., et al. (2015). Promoting an open research
culture. Science, 348(6242), 1422–1425. doi:10.1126/science.aab2374

Quote

The situation is a classic collective action problem. Many individual researchers lack
strong incentives to be more transparent, even though the credibility of science would 
benefit if everyone were more transparent.

So I think I’ll try to step up the pace of logging my daily scientific work. One super easy thing to do is to publish my daily log from my task management in orgmode. Indeed I am also reading at the moment this guy who says

Quote

The core of your documentation is the research log.
   
Long, S. (2015). Reproducible Results and the Workflow of Data Analysis. 
Retrieved from http://www.indiana.edu/~jslsoc/ftp/WIM/wf wim 2015 2015-08-21@3.pdf

Finally, I was struck by this reference http://rich-iannone.github.io/about/2014/10/28/introduction.html to something about 365+ day GitHub streaks. It was covered earlier by Geoff Greer, and by Dirk Eddelbuettel.

It seems the basic concept is that you can leverage off an obsessive tendency by making sure you do something toward ticking off items from the task list every day. The impulse to not breaking the chain is supposed to give you inspiration to keep going. I think this might work well for my temperatment.

Emacs and orgmode

The set up of my daily log is pretty simple. After being set up by kjhealy’s starter kit. Then I modified the org-agenda-files which was set in the main el file that kjhealy provided and then with the command C-c a a emacs will display my calendar.

When I open emacs in the morning I open the agenda and this also opens research-log file. I move to that buffer, then I use this key command to insert a new entry for todays date

CODE

 (define-skeleton org-journalentry
   "Template for a journal entry."
   "project:"
   "*** " (format-time-string "%Y-%m-%d %a") " \n"
   "**** TODO-list \n"
   "***** TODO \n"
   "**** timesheet\n"
   "#+begin_src txt :tangle work-log.csv :eval no :padline no\n"
   (format-time-string "%Y-%m-%d %a") ", " str ", 50\n" 
   "#+end_src\n"
 )
 (global-set-key [C-S-f5] 'org-journalentry)

This creates a new date, a stub of a TODO for anything ad hoc and a entry into my timesheet.csv file.

I then select from TODO items from a global list that I keep at the top of the file, and cut/paste them into todays list.

img

Great so I just moved this research-log orgmode file into my blog github repo, and with the help of charlie park’s bash script I am good to go

CODE

alias build_blog="cd ~/projects/ivanhanigan.github.com.raw; jekyll b;
cp -r ~/projects/ivanhanigan.github.com.raw/_site/* ~/projects/ivanhanigan.github.com;
cd ~/projects/ivanhanigan.github.com;git add .;git commit -am 'Latest build.';git push"
alias bb="build_blog"

So this will put the resulting changes onto my open lab book website here https://raw.githubusercontent.com/ivanhanigan/ivanhanigan.github.com/master/work-log.org

Things to note:

  • I found this list of tips http://natashatherobot.com/streak-github-mistakes/
  • In particular I realise I need to make my daily push by 4:50 PM in Canberra ACT as this is 11:50 PM the previous day for Github, Pacific Time (PT)
  • I also will need to ensure I don’t publish sensitive (or embarrasing entries).
  • I’ll try to keep the identity of my collaborators private as well, so just use their initials rather than names.

Posted in  disentangle


how-to-say-why-before-what

I have discovered a flaw in my writing style.
I often say what it it before I say why it is important.

Example

Disentangling health effects of environmental from social factors
is difficult for a variety of reasons. The effort to examine and
to separate environmental and social causes is nevertheless
valuable. [WHY IS IT VALUABLE?] This is especially important to
policy makers and to others who seek to maximise the public
good. A greater understanding of their respective contributions
will lead to more rational, deep-seated, lasting and effective
interventions.

The caps question is from someone reading my draft. I need to start with the why. Perhaps just turn the paragraph on its head?

Example

A greater understanding of the respective contributions from
environmental and social factors will lead to more rational,
deep-seated, lasting and effective interventions by policy makers
and to others who seek to maximise the public good.  Disentangling
health effects of environmental from social factors is difficult
for a variety of reasons. The effort to examine and to separate
environmental and social causes is nevertheless valuable.

Posted in  disentangle


tracking-a-data-analysis-pipeline

I have just uploaded a new version of the windows build for my ‘disentangle’ package. The blurb of the draft vignette is below.

Introduction

It can be much easier to conceptually understand a complicated data analysis pipeline than it is to implement that pipeline effectively. This report outlines the use of the ‘disentangle’ R package, available from http://ivanhanigan.github.io/projects.html. This package contains functions that were developed to aid data analysts to map out all the aspects of their work when planning and conducting complicated data analyses using the pipeline concept. There are often many steps in the design and analysis of a study and when these are put together as a data analysis pipeline this addresses the challenge of reproducibility (Peng 2006). The credibility of data analyses requires that every step is able to be scrutinised (Leek 2015).

Motivating scientific questions

The type of data analysis that is the focus of this work is more complicated than simply loading some data that are already cleaned, fitting some models and reporting some output. Typically the type of data analysis projects that these tools are aimed at involve attempts to control for a large number of inter-relationships and associations between variables. It is especially problematic that these variables need to have been selected by the scientists from a multitude of possible variables and a plethora of possible data sources, during a long process of data collection, cleaning, exploration and decision making in preparation for data analysis. There are also a multitude of steps and decision points in the process of model building and model checking. The use of statistical models involving many entangled environmental and social variables can easily result in spurious association that may be mistakenly interpreted as causation. Projects that the author has been involved in include explorations of hypotheses about health effects from droughts, bushfire smoke, heat-waves and dust-storms which produced novel findings, and informed controversial debates about the implications of climate change. The requirement to adequately convey the methods and results of this research was problematic and motivated the work on effective use of reproducible research techniques and data analysis pipelines.

Posted in  disentangle