Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

Using scholar rankings to provide weights in systematic literature reviews part 1

I’ve been thinking alot recently about an approach used in this recent systematic review

Vins, H., Bell, J., Saha, S., & Hess, J. (2015). The mental health
outcomes of drought: A systematic review and causal process
diagram. International Journal of Environmental Research and Public
Health, 12(10), 13251–13275. doi:10.3390/ijerph121013251

They identify causal pathways from papers and ascribe the supporting evidentiary weight based on the number of papers published with findings that support this cause-effect pathway
The raw number of papers is probably not a good metric, prone to bias so I was thinking of ways to ascribe weight based on quality of journal or authors
This is not supposed to replace the need to actually read the papers, but purely as an additional source of information
This recent post on scholar metrics provided some impetus via h-indices http://datascienceplus.com/hindex-gindex-pubmed-rismed/
I also think this approach of text mining the abstracts could be useful http://tuxette.nathalievilla.org/?p=1682

Sanity check of the two options using myself as guinea pig

library(RISmed)
x <- "hanigan+ic[Author]"
res <- EUtilsSummary(x, type="esearch", db="pubmed", datetype='pdat', mindate=1900, 
  maxdate=2015, retmax=500)
str(res)
citations1 <- Cited(res)
citations <- as.data.frame(citations1)
citations <- citations[order(citations$citations,decreasing=TRUE),]
citations <- as.data.frame(citations)
str(citations)
citations <- cbind(id=rownames(citations),citations)
citations$id<- as.character(citations$id)
citations$id<- as.numeric(citations$id)
hindex <- max(which(citations$id<=citations$citations))

hindex
# 5

library(scholar)
myid <- "cGN1P0wAAAAJ"
y <- scholar::get_publications(myid)
str(y)
y[,c("author", "cites")]
y$id <- as.numeric(row.names(y))
hindex2 <- max(which(y$id<=y$cites))
hindex2
# 15

Clearly the pubmed and google scholar search engines makes a big difference to my score!

Posted in disentangle

17 Dec 2015

My project inventory

Auditing and inventorising

I just completed an audit of my project files and updated the list on my website http://ivanhanigan.github.com/projects.html
This was enabled by the work I have been doing on a data inventory web2py app https://github.com/ivanhanigan/data_inventory
The list shows some of the project data collections I have amassed during my research over the last 15 years
Some of these are data collections I have developed, others are derivatives of collections originated by others
Many of these are areas of active research, but others are dormant
This list will be updated as time permits.

Projects

1 Air pollution
2 Australian health
3 Australian population
4 Biodiversity and environmental change
5 Bioregions
6 Cardio-respiratory disease, biomass smoke, dust and heatwaves
7 Climate Change
8 Eco-social observatories
9 Extreme Weather Events
10 GIS
11 Infectious diseases and local habitat
12 Mental health and drought
13 Mortality and morbidity effects from weather
14 Reproducible research pipelines
15 Medical geography theory and tools
16 Roads and places
17 Transformational adaptation
18 Ultraviolet radiation
19 Water
20 Weather

Posted in blog home

07 Dec 2015

show-missingness-in-large-dataframes-with-ggplot-thanks-to-r-blogger

This is a revision of my post /2015/10/show-missingness-in-large-dataframes-v2
This guy posted http://www.njtierney.com/r/missing%20data/rbloggers/2015/12/01/ggplot-missing-data/

Let’s try it out!

library(devtools)
# depends
install.packages("gbm")
install_github("tierneyn/neato")
library(neato)
# small eg
locs=c("Australia","India","New Zealand","Sri Lanka","Uruguay","Somalia")
f1=c(T,F,T,T,F,F)
f2=c(F,F,F,T,F,F)
f3=c(F,T,T,T,F,T)
atable=data.frame(locs,f1,f2,f3)
atable[atable == FALSE] <- NA
atable
png("ggplotmissing.png")
ggplot_missing(atable)
dev.off()

/images/ggplotmissing.png

The one I had problems with because too large is:

# Cool but what about a big one?
dat <- read.csv("~/path/to/file.csv")
str(dat)
png("ggplotmissing2.png", height=1800, width = 3000, res = 200)
ggplot_missing(dat)
dev.off()

/images/ggplotmissing2.png

Posted in disentangle

02 Dec 2015

Notes from Dr Climate Re data reference syntax models for file organisation and naming

This is an excellent explanation of the Australian Integrated Marine Observing System (IMOS) Data Reference Syntax by Damien Irving on the Dr Climate blog https://drclimate.wordpress.com/2015/09/04/managing-your-data/
A Data Reference Syntax (DRS) – a convention for naming your files

<computer>/<project>/<organisation>/<collection>/<facility>/<data-type>/<site-code>/<year>/

The data type has a sub-DRS of its own, which tells us that the data
represents the 1-hourly average surface current for a single month
(October 2012), and that it is archived on a regularly spaced spatial
grid and has not been quality controlled.

Just in case the file gets separated from this informative directory
structure, much of the information is repeated in the file name
itself, along with some more detailed information about the start and
end time of the data, and the last time the file was modified:

<project>_<facility>_V_<time-start>_<site-code>_FV00_<data-type>_<time-end>_<modified>.nc.gz

In the first instance this level of detail seems like a bit of
overkill... 

Since the data are so well labelled,
locating all monthly timescale ACORN data from the Turquoise Coast and
Rottnest Shelf sites (which represents hundreds of files) would be as
simple as typing the following at the command line:

$ ls */ACORN/monthly_*/{TURQ,ROT}/*/*.nc

Damien’s personalised DRS

It is worthwhile thinking through these ideas and incorporating them in ones data management system as early as possible
Damien has also helpfully openly shared his own DRS at https://github.com/DamienIrving/climate-analysis/blob/master/data_reference_syntax.md
Here is a summary of some key items I’m going to implement versions of for my own work

Basic data files

<var>_<dataset>_<level>_<time>_<spatial>.nc

Sub-categories:

<time>: <tstep>-<aggregation>-<season>
<spatial>: <grid>-<region>-<bounds>-<np>

Where:

<tstep>: daily, monthly
<aggregation>: 030day-runmean, anom-wrt-1979-2011, anom-wrt-all
<season>: JJA, MJJASO
<grid>: native or something like y181x360, which describes the number of latitude (181) and longitude (360) points (in this case it is a 1 by 1 degree horizontal grid).
<region>: Region names are defined in netcdf_io.py
<bounds>: e.g. lon225E335E-lat10S10N or mermax, zonal-anom
<np>: North pole location, e.g. np20N260E

Examples include:
psl_Merra_surface_daily_y181x360.nc

More complex file names

<inside>_<filters>_<prev-var>_<dataset>_<level>_<time>_<spatial>.nc

Sub-categories:

<inside>: The variable inside the file. e.g. tas-composite, datelist
<filters>: e.g. samgt90pct (gt and lt and used for greater and less than, pct for percentile)
<prev-var>: if it’s not obvious what variable <inside> was created from, include the previous variable/s

Examples:
tas-composite_pwigt90pct_ERAInterim_500hPa_030day-runmean-anom-wrt-all_native-sh.png

Principles of Tidy Data

In the words of Hadley Wickham the order that data should be arranged in follows some generic principles:

'A good ordering makes it easier to scan the raw values. One way of
organizing variables is by their role in the analysis: are values
fixed by the design of the data collection, or are they measured
during the course of the experiment? Fixed variables describe the
experimental design and are known in advance. Computer scientists
often call fixed variables dimensions, and statisticians usually
denote them with subscripts on random variables. Measured variables
are what we actually measure in the study. Fixed variables should come
first, followed by measured variables, each ordered so that related
variables are contiguous. Rows can then be ordered by the first
variable, breaking ties with the second and subsequent (fixed)
variables.'

An exemplar

In my last project the protocol we developed (for an ecology and biodiversity database) had a naming convention which relied heavily on a sequence of information being used to order the names of folders, subfolders and files. This is:

The project name (and optional sub-project name)
Data type (such as experimental unit, observational unit, and/or measurement methods)
Geographic location (locality name, State, Country)
Temporal frequency and coverage (such as annual or seasonal tranches).

The concepts of slow moving dimensions and fast moving variables

The concept of dimensions and variables can be useful here, and especially for deciding on filenames. Dimensions are fixed or change slowly while variables change more quickly. By ‘change’, this means that there are more of them. For example the project name is ‘fixed’, that is it does not change across the files, but the sub-project name does change, just more slowly (say there may be 2-3 different sub-projects within a project). Then there may be a set of data types, and these ‘change’ more quickly than the sub-project name. Then the geographic and temporal variables might change quickest of all.

So a general rule for the order of things can be stated. The fixed and slowly changing variables should come first (those things that don’t change, or don’t change much), followed by the more fluid variables (or things that change more across the project). List elements can then be ordered so that the groups of things that are similar will always be contiguous, and vary sequentially within clusters.

So the only thing I disagree with Damien about is his decision to put space after time:

<var>_<dataset>_<level>_<time>_<spatial>.nc

This is because I think that the geography is more stable than the time period for a data collection, and as most of my studies look at changes of variables measured at a location over time I generally want to compare the same spot at multiple times. There are pros and cons of each approach such as if the analyst wants to make maps of a variable measured at several locations at a single point in time then having the data arranged by time first and then location may make that job simpler.

I also notice however that the IMOS syntax puts the site spatial location before the year.

Posted in disentangle

29 Nov 2015

Visualisation tools for communicating data management concepts

Hyperlinked table of contents that looks like a filing system

This looks like it might be useful to display information about filing systems, with a clickable toc that looks like a filing system!

Source:

http://tex.stackexchange.com/a/36185

Posted in disentangle

26 Nov 2015

« Previous Next »

Welcome to my Open Notebook

Using scholar rankings to provide weights in systematic literature reviews part 1

Sanity check of the two options using myself as guinea pig

Clearly the pubmed and google scholar search engines makes a big difference to my score!

My project inventory

Auditing and inventorising

Projects

show-missingness-in-large-dataframes-with-ggplot-thanks-to-r-blogger

Let’s try it out!

Notes from Dr Climate Re data reference syntax models for file organisation and naming

Damien’s personalised DRS

Principles of Tidy Data

An exemplar

The concepts of slow moving dimensions and fast moving variables

Visualisation tools for communicating data management concepts

Hyperlinked table of contents that looks like a filing system

About

Recent Entries

Categories

Entries grouped by Tags

Welcome to my Open Notebook

Using scholar rankings to provide weights in systematic literature reviews part 1

Sanity check of the two options using myself as guinea pig

Clearly the pubmed and google scholar search engines makes a big difference to my score!

My project inventory

Auditing and inventorising

Projects

show-missingness-in-large-dataframes-with-ggplot-thanks-to-r-blogger

Let’s try it out!

Notes from Dr Climate Re data reference syntax models for file organisation and naming

Damien’s personalised DRS

Principles of Tidy Data

An exemplar

The concepts of slow moving dimensions and fast moving variables

Visualisation tools for communicating data management concepts

Hyperlinked table of contents that looks like a filing system

Subscribe

About

Recent Entries

Categories

Entries grouped by Tags