Modelling in Opasnet

From Opasnet
Jump to: navigation, search


Opasnet modelling environment is an open, web-based platform for collaboratively developing numerical models.

Opasnet wiki-R modeling platform

Introduction

Opasnet has integrated tools for building and running easily accessible statistical models in the wiki. The platform is completely modular and individual variables are perfectly reusable.

List of main concepts:

  • Universal variable structure and related principles
  • Variable wiki page format
  • Wiki-R platform
    • OpasnetUtils
    • Modelling practices
  • Data format and uploading

Technical main features

  • Wiki - pages provide a natural analogy to variables in a statistical model. They contain descriptive information as well as necessary meta data (i.e. scope).
  • R - is an open sourced statistical programming language akin to for example Matlab. The Opasnet wiki has an extension (R tools) to include executable R scripts on any page. The output is displayed in html format as an applet or seperate tab.
  • Database - MongoDB is used to store variable related data.
  • Interfaces - All these different components obviously need to work together and we have built interface solutions for each of the combinations: R-tools for running wiki integrated R scripts, Table2Base (php) for uploading wiki tables to the database, OpasnetBase wiki extension (php) for showing database entries in the wiki and the opbase script family for communication between R and the database.
  • OpasnetUtils - is an R package (library), which contains tools for building mathematical models within our modeling framework which is described below in detail. The OpasnetUtils platform is completely modular, recursive for ease of use (very simple commands can trigger the execution a complex model) and fully customizable. A knowledgeable user can take over any automated model part and resume automation for the rest of the model. Use of the package revolves around Ovariables.
  • All aspects of the Opasnet Modelling environment are open source and provided free of charge.

Usage

Mathematical models consist of variables, which may be known or unknown and/or can be derived from other variables using more models. Modeling in Opasnet is variable-centric. Since variables are defined universally, they should be reusable in all other models (partly or wholly). Though naturally more complex models with extremely large data sets will need more customized and static definitions to run efficiently. In practice known variables can be defined by writing parseable tables (t2b) in the wiki pages or uploading data sets directly to the OpasnetBase and downloading them in the variable defining R code or using some existing datatools within packages installed on the Opasnet server for example ScraperWiki.

...

As discussed above, variables should be considered isolated and be defined as reusable as possible. However, multiple versions of the same variable might be desirable when the scopes of models using said variable differ. For example the population of the world by age, country, sex etc. vs population of a single country defined for a spatial grid. While data resolution is good, for the purposes of some models less specific data is sufficient. If multiple versions of a variable exist they should be listed under a specific category.

The bulk of any model is data. The main tool for storing data is the Opasnet Base. For information about the different ways of uploading data to the database please check Uploading to Opasnet Base. Any files uploaded to the Opasnet server can be accessed as well, though the preferred method is to upload straight to the database.

Formulas for latent variables are dependent on other variables which most of the time eventually lead to data. Defining data based variables is very straightforward: get all relevant information into one table and make sure there are no unnecessary columns to clutter any latent variables. The process of defining new data based variables is streamlined for well defined and formatted tables (that do not need R code to clean them), a single function call creates an ovariable, downloads relevant data from the database and saves the R object on the server.

Formulas should be described and defined on the wiki page as well as in any R code that uses them. Sometimes a standard method is used for in the formula, in which case the method page should be linked to and a short description about how the variables and the methods and their parameters match up.

Formulas are essentially functions that may return an ovariable or a data.frame. For robustness the formulas should take '...' as a parameter which causes any excess parameters to not trigger errors. Formula dependencies are defined in an ovariable slot and will be handled automatically. The dependencies can then be operated mathematically in the formula function like any other simple variable.

Scenarios, fictuous values and such should be defined separately from the variables themselves. Since scenarios and custom inputs are often model specific, they are defined at the model level and supplied upstream to be incorporated into evaluated results of variables.

Project management in Opasnet

Opasnet aims to offer a systematic workspace with tools needed to manage data all the way from original observations to public discussions and decision support. Some related pages are also listed here.

List of related pages

This is a plan for an improved page structure for pages related to modelling, databases, and codes in Opasnet.

Portal:Modelling with Opasnet Main page. Contains a brief introduction and links to the content.

Practices
Tools

Question

How should modelling be done in Opasnet in practice? This page should be a general guidance on principles, not a technical manual for using different tools.

What should be the main functionalities of Opasnet modelling environment such that

  • it supports decision analysis,
  • it supports BBNs and Bayesian inference,
  • it mainly contains modelling functionalities for numerically describe reality but
  • it is also possible to numerically describe scenarios (i.e., deliberate deviations from the truth in order to be able to compare two alternative worlds that are the same in other respect than the deliberate deviation).

Answer

For a general instruction about contributing, see Contributing to Opasnet.

Modelling in Opasnet(-)
ObsPropertyGuidance
1StructureAnswer should be a data table either on the page or uploaded to Opasnet Base using R code.
2StructureThe indices should logically match those of parent objects.
3ApplicabilityThe question of an object should primarily be tailored according to the particular needs of the assessment under work, and only secondarily to general use.
4CoherenceThe Answer of an object should be coherent with all information used in the object. In addition, it should be coherent with all other objects. If some information in another object affects the answer of this object, a link to the other object should be placed under Rationale, and specifically under Dependencies if there is a causal connection.
5CoherenceEnsuring coherence is a huge task. Therefore, simple things should be done first and more laborious only if there is a need. The order in which things should be done is usually this: a) Search for similar objects and topics in Opasnet. b) If found, make links to them in both directions. c) Discuss the related info in Rationale. d) Include the info in calculations of the Answer. e) Merge the two related objects into one larger object that contains all information from the two objects and that is internally coherent.
6CoherenceWhen you find two (or more) pieces of information about one topic, but the pieces are inconsistent, describe the Answer in this way (from simple to complex): a) Describe qualitatively what was found. b) Describe the Answer quantitatively as a list of possible hypotheses, one hypothesis for each piece of information. c) Describe the hypotheses probabilistically by giving the same probability to each hypothesis. d) Using expert judgement and/or open critical discussion, adjust probabilities to give less weight to less convincing hypotheses. e) Develop a probabilistic model that explicitly describes the hypotheses based on our understanding about topic itself and the quality of the info, and use the info as input data.
7Multi-site assessmentWhen several similar assessments are to be performed for several sites, the structure of the assessments should contain a) a single page for the multi-site assessment, including code that has the site name as input, b) a single summary page containing a list of all sites and their individual pages, structured as a data table, c) an individual page for each site containing a data table with all site-specific parameter values needed in the assessment.
8FormulaWhenever possible, all computing code should be written in R.
9FormulaThe same code should not be copied to several different pages. R code should provide results relevant to the question on the page. Relevant descriptions of other variables should mostly be retrieved using the dependencies hook (automatically handled ovariable slot) or objects.latest().
10FormulaSome procedures repeat themselves over and over again in impact assessments. These can be written as functions. Common or important functions can be included in libraries that are available in R-tools. Search for R-tools libraries so that you learn to use the same functions as others do.
11FormulaWhen you develop your own functions with a general purpose, you should suggest your own functions to be added to an R-tools library.
12Preferred R codeVariables should be described as ovariables where possible.
13Preferred R codeProbabilistic information is incorporated in a data.frame using the Iteration-column, which contains the number of Monte Carlo iteration from 1 to n (samplesize).
14Preferred R codeThe ggplot2 graphics package is recommended for drawing graphs and diagrams.
15Preferred R codeUploading to and downloading from Opasnet Base is done using opbase family functions (see ?opbase)
16Preferred R codeThe summary-function can be used to acquire standard summaries of variable distributions.
17Preferred R codeOvariables can be mathematically operated with.
18Preferred R codeOvariables can be merged (joined based on common columns) using merge() and rbound (appended) using orbind() or combine()

Links related to the answer: Data table Opasnet Base R Parent object Child object R-tools OpasnetBaseUtils ggplot2 tapply merge data.frame rbind cbind

Note! The text talks about objects, which means any information objects. The most common objects are variables.

Relationship of Answer and Rationale

All variable pages should have a clear question and a clear answer. The answer should typically be in a form of a data table that has all indices (explanatory columns) needed to make the answer unambiguous and detailed enough. If the answer table is very large, it might be a bad idea to show it on the page; instead, a description is shown about how to calculate the answer based on Dependencies and Rationale, and only a summary of the result is shown on the page; the full answer is saved into Opasnet Base.

The answer should be a clear and concise answer to the specific question, not a general description or discussion of the topic. The answer should be understandable to anyone who has general knowledge and has read the question.

In addition, the answer should be convincing to a critical reader who reads the following data and believes it is correct:

  • The Rationale section of the page.
  • The Answer sections of all upstream variables listed in the Dependencies section.
  • In some cases, also downstream variables may be used in inference (e.g. in hierarchical Bayes models).

It should be noted that the data mentioned above should itself be backed up by original research from several independent sources, good rationale etc. It should also be noted that ALL information that is needed to convince the reader should be put into the places mentioned and not somewhere else. In other words, when the reader has read the rationale and the relevant results, (s)he should be able to trust that s(he) is now aware of all such major points related to the specific topic that have been described in Opasnet.

This results in guidance for info producers: if there is a relevant piece of information that you are aware of but it is not mentioned, you should add it.

R help

This section contains some basic functionalities of R software as it is applied in Opasnet.

library(OpasnetUtils) # Opens the OpasnetUtils package. This is needed to use ovariables.

# Define a data frame with five rows and three columns.
dat <- data.frame( 
	A = 1:5, 
	B = c("a", "b", "c", "d", "e"),
	Result = c(6, 45, 2, 4.5, 2)
)

dat$A  				# The vector that forms column A
dat[["A"]]			# The vector that forms column A
dat[[1]]			# The vector that forms the first column (with name A).
dat["A"] 			# The column with name A (this is a data frame with one column).
dat[1] 				# The first column  (with name A; this is a data frame)
dat[c(1,3)]			# Data frame with the first and third columns.
dat[c("A", "Result")] 		# Data frame with the first and third columns.
dat[2:4, 1:2] 			# Data frame with rows 2 to 4 and columns 1 to 2.
dat[2:4, "Result"] 		# The vector that is formed from rows 2 to 4 of column Result.
dat$Result[2:4] 		# The same as previous.
dat[2:4, ]["Result"] 		# Data frame that first takes rows 2 to 4 and then column Result.
dat[2:4, "Result", drop = FALSE] # The same as previous. Drop = TRUE (default) turns one-column data.frames into vectors.
dat[ , colnames(dat) != "B"]	# Data frame without column B.
dat[ , !colnames(dat) %in% c("A", "B")] # Data frame without columns A and B.
dat$B <- NULL			# Permanently removes column B from data frame.
odat <- Ovariable("odat", data = dat) # Ovariable that has dat as data.
odat@data 			# Data slot of odat (equals the data.frame dat).
odat 				# All slots of odat.
odat <- EvalOutput(odat)	# Evaluate odat (i.e., calculate the output)
odat@output			# Output slot of odat.
result(odat)			# The result column (as vector) of the output slot of dat.
summary(odat) 			# Summary of odat output. If odat is probabilistic, summary includes mean and other statistics.
odat@marginal 			# Which columns in output are marginals?
colnames(odat@output) 		# Names of columns of the output of odat
colnames(odat@output)[odat@marginal] # Names of marginal columns.

Indices of the data table

The indices, i.e. explanatory columns, should match in variables that are causally connected by a causal diagram (i.e., mentioned in Dependencies). This does not mean that they must be the same (as not all explanations are relevant for all variables) but it should be possible to see which parts of the results of two variables belong together. An example is a geographical grid for two connected variables such as a concentration field of a pollutant and an exposure variable for a population. If the concentration and population use the same grid, the exposure is easy to compute. However, they can be used together with different grids, but then there is a need to explain how one data can be converted into the other grid for calculating exposures.

Increasing preciseness of the answer

This is a rough order of emphasis that could guide the work when starting from scratch and proceeding to highly sophisticated and precise answers. The first step always requires careful thinking, but if there are lots of easily available data, you may proceed through steps 2 - 4 quickly; with very little data it might be impossible to get beyond step 3.

  1. Describe the variables, their dependencies and their indices (explantaions) to develop a coherent and understandable structure and causal diagram.
  2. Describe the variables as point estimates and simple (typically linear) relations to make the first runnable model. Check that all parts make sense. Check that all units are consistent. Whether all values and results are plausible is desirable but not yet critical.
  3. Describe the variables as ranges in such a way that the true value is most likely within the range. This is more important than having a very precise range (and thus higher probability not covering the truth). This may result in vague conclusions (like: It might be a good idea to do this, but on the other hand, it might be a bad idea). But that's exactly how it should be: in the beginning, we should be uncertain about conclusions. Only later when we collect more data and things become more precise, also the conclusions are clarified. At this step, you can use sensitivity analyses to see where the most critical parts of your model are.
  4. The purpose of an assessment model is to find recommendations for actions. Except for the most clear cases, this is not possible by using variable ranges. Instead, probability distributions are needed. Then, the model can be used in optimising, i.e., finding optimal decision combinations.
  5. When you have your model developed this far, you can use the Value of information analysis (VOI analysis) to find the critical parts of your model. The difference to a sensitivity analysis is that a VOI analysis tests which parts would change your recommendation, not which parts would change your estimate of outcome. Often the two analyses point to the same direction, but a VOI analysis is more about what you care, while a sensitivity analysis can be performed even if no explicit decision has yet been clarified.

Memory usage in R

  • [2] Hadley Wckham's introduction to memory use in R] and guidance how to use it effectively.

Spatial data in R

Bayes in Opasnet

In R-tools, we have implemented the Bayes package rjags with the JAGS engine. You can create and run hierarchical Bayesian models with it.

Bayesian modelling

rjags package

R and Hugin (a BBN software)

Other

Examples

Life tables in R

  • Building a life table [9]
  • Calculating life years lost [10]

Regression

Lognormal and normal distributions

Rationale

A draft based on own thinking. Not even the topics are clear yet.

Montako lukiolaista tarvitaan korvaamaan 1 asiantuntija? Laske tehokas asiantuntijan opiskeluaika ja se osuus joka siitä tarvitaan ratkaisemaan kyseinen ongelma

Arvaus: 10. Asiantuntijat halveksivat pinnallista tietoa ja heillä on syvällistä. Mikä ero? Kytkennät. Jos 2 asiaa on mahdollisia mutta ei yhtaikaa, asiantuntija tunnistaa tämän mutta maallikko ei. Lukiolaisista saadaan asiantuntijoita opettamalla heille menetelmä kuvata kytkentöjä. Sen jälkeen kaiken tiedon ei tarvitse enää olla 1 ihmisen päässä.

Ihmisten on vaikea hahmottaa, että lukuisia ongelmia voidaan ratkoa kerralla samalla menetelmällä. Sen sijaan yhden ongelman ratkaisuun voidaan motivoida suuria joukkoja, jos aihe on heille tärkeä. Pitäisikö siis löytää se yksi tärkeä asia? Muut sitten alkavat ratketa vahingossa.

Vaikeaa on myös nähdä metatason kysymyksiä eli järjestelmää tai itseä osana isompaa rakennetta, jonka puitteissa ovat myös mahdolliset maailmat ja jonka sisältä ratkaisut löytyvät.

Mielikuvituksen jaloin laji on kuvitella hyviä asioita, jotka voisivat olla mutta eivät ole, sekä niiden ei-olemisen ja olemisen välistä polkua.

Tieteellinen tiede on kuin amerikkalainen unelma: tieteen menetelmin tehdään riittävästi läpimurtoja jotta joka sukupolvelle riittää omat menestystarinansa ja idolinsa, mutta käytännössä tieteen metodi on liian kaukana tutkijan arjesta jotta se todella siihen vaikuttaisi. Niinpä tutkijat elävät illuusion varassa kuten amerikkalaisetkin, ja puurtavat vailla mahdollisuuksia todellisiin tavoitteisiinsa jotka ovat suurempia ja vaikuttavampia kuin mihin tieteen järjestelmä antaa mahdollisuuksia. Tutkijoiden aika ja resurssit menevät 2 asian miettimiseen: mistä saan rahaa ja miten saan ajatuksiani julkaistuksi. Sen sijaan ajatustensa itsensä kehittämiseen on aina liian vähän aikaa. Niinpä ei vain tavoitteet vaan myös kyvyt ovat suuremmat kuin mihin järjestelmä taipuu. Parhaiten pärjäävät toimitusjohtajatyypit, jotka osaavat organisoida rahankeruun, julkaisemisen ja instituutiot oman mielenkiintonsa kohteisiin.

A meta-analysis produces an esimate (with confidence intervals) for the average exposure-response. However, we are interested in the distribution of the individual exposure-response functions. If we know the distribution of the individual distributions, we can always produce the distribution for the average. In addition, if we assume that our sample of individuals is not a random sample from the whole population of interest, we can try to assess the subgroup's bias from the random sample, and in this way produce sub-population-specific exposure-response distributions. In other words, we can mimic a real study using bootstrapping (biased bootstrapping, if we have an estimate how the study is biased) from our individual-level whole-population distribution; in this way, we can test whether it is plausible that the study data actually came from such a whole-population distribution in such a way we think. This is difficult task, but if it works, it offers a method to systematically include any studies in the same examination of individual-level (not average) exposure-response functions.

See also

Pages related to Opasnet Base

Opasnet Base · Uploading to Opasnet Base · Data structures in Opasnet · Opasnet Base UI · Modelling in Opasnet · Special:Opasnet Base Import · Opasnet Base Connection for R (needs updating) · Converting KOPRA data into Opasnet Base · Poll · Working with sensitive data · Saved R objects

Pages related to the 2008-2011 version of Opasnet Base

Opasnet base connection for Analytica · Opasnet base structure · Related Analytica file (old version File:Transferring to result database.ANA) · Analytica Web Player · Removed pages and other links · Standard run · OpasnetBaseUtils

Keywords

Modelling, Opasnet Base, scenario

References


Related files

<mfanonymousfilelist></mfanonymousfilelist>