# Computer-based work (mostly statistics and programming)

**Guestimating GHG break-even point for biomass gasification**

**Guestimating GHG break-even point for biomass gasification**

Wood gas generation, using wood, manure, compost and like, produces CH4 under anaerobic conditions. Since renewable resources were used, biomass gas are considered sustainable etc. However, CH4 is a potent green-house gas, 84 times higher greenhouse warming potential than CO2 over 20 years (https://en.wikipedia.org/wiki/Greenhouse_gas), and all biomass gasifiers leak. In industrial settings, leakage is under 5% (https://www.umweltbundesamt.de/themen/biogasanlagen-muessen-sicherer-emissionsaermer). In developing countries, particularly for self-made biomass generators and manual methan transport (https://www.deutschlandfunk.de/mini-biogasanlagen-fuer-afrika-wirtschaftsfoerderung-statt.1773.de.html?dram:article_id=459738), such leakages can be expected to easily be in the 20-30%. The aim of this project is to compute the break-even point of biomass gasification, given the global warming potential difference between CO2 and CH4. How much leakage is acceptable before causing more problems than solving them? The key point here is to (a) establish a transparent derivation of the balance; and (b) consider different time horizons of GHG activity of the two gases.

**Suitable as:*** *MSc project, in collaboration with Prof. Stefan Pauliuk (Industrial Ecology).

**Contact:** **Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de

**Fitting the elephant Integral Projection Model to observed data from Amboseli, Kenya**

**Fitting the elephant Integral Projection Model to observed data from Amboseli, Kenya**

African elephant populations have been studied extensively. Local censuses date back to the 19th century and yet, historical estimates of the continental elephant population are scarce and uncertain. This project aims to estimate the population dynamics and spatial distribution of the African elephant from 1900 until today. Population size estimates can be derived from census reports and other published material. This project will be part of a larger demographic analysis of the continental African elephant population, which provides the opportunity to work alongside field and theoretical ecologists.

For his PhD, Severin Hauenstein has developed a population model, akin to, but more advanced than, a structured matrix population model. So far, this model is parameterised from literature data, yielding nice predictions for a population in Kenya.

The next step, taken here, is to actually parameterise the model with the data of Amboseli, i.e. to fit the model to data. This requires a Bayesian model calibration approach, which is an intellectual hurdle, but also really cool. Apart from overall population size, the actual number of elephants in each age or size group will yield valuable information for the model parameters. The data are partially from elephant researchers in Africa, partially from the literautre. Data and model code are available, and tutorials for using Bayesian calibration are provided e.g. by R's BayesianTools package.

**Suitable as:*** *MSc project, requiring an interest in statistics and optimization.

**Requirements:** Willingness to engage in computer-intensive, statistical work.

**Time:** The project can start anytime.

**Contact:** **Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de

**Literature:*** *White, J. W., Nickols, K. J., Malone, D., Carr, M. H., Starr, R. M., Cordoleani, F., Baskett, M. L., Hastings, A., & Botsford, L. W. (2016). Fitting state-space integral projection models to size-structured time series data to estimate unknown parameters. *Ecological Applications*, *26*(8), 2677–2694. https://doi.org/10.1002/eap.1398

**An Individual-Based Model of African elephant demography**

**An Individual-Based Model of African elephant demography**

In his PhD thesis, Severin Hauenstein developed a population model for African elephant. It describes survival and fecundity as a function of elephant size, rather than age (hence called an “integral projection model”, IPM). It thereby allows accommodating environmentally-driven variations in growth rates, e.g. less during droughts. Also, the carrying capacity, and hence the density-dependence of demographic rates, is integrated in this model. So far, this is so-called “mean-field approach”, in which no consideration is paid to variability between individuals: given their size, all individuals have the same model parameters.

An alternative approach to modelling population dynamics, “individual-based models” (a.k.a. agent-based models) allow representing variability among individuals. This is relevant only when the feature of interest, e.g. population size or population growth rate, is a non-linear function of the model parameters. That is the case in this demographic model. What is unclear is how much the representation of individual variability will affect model predictions.

The idea of this project is hence to re-implement the demographic model as an IBM, and compare the simulations with those of the original. One advantage of the IBM is that it is relatively easy to add further details and features. One disadvantage is that an IBM is much slower and hence more time-consuming to run repeatedly (This disadvantage should not be relevant for such a simple model).

**Suitable as:*** *MSc-project, requiring an interest in programming, preferably in python (or julia) or netlogo or C/C++ or, if need be, in R.

**Contact:** **Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de

**Literature:*** *Boult, V. L., Quaife, T., Fishlock, V., Moss, C. J., Lee, P. C., & Sibly, R. M. (2018). Individual-based modelling of elephant population dynamics using remote sensing to estimate food availability. *Ecological Modelling*, *387*, 187–195. https://doi.org/10.1016/j.ecolmodel.2018.09.010

**How does overdispersion of count data (non-independent events) affect quantitative network analysis?**

**How does overdispersion of count data (non-independent events) affect quantitative network analysis?**

Network analysis is a popular tool for understanding the complexity of ecosystems with respect to species interactions, for example those between plants and their pollinators. Quantitative networks are supposed to be more meaningful for ecosystem functions and more robust to sampling effects. However, many methods for quantitative networks assume that network data (interaction frequency) are based on independent events. Just like in regular poisson regression, this assumption may often be violated: multiple visits by the same individual, social behavior or spatiotemporal heterogeneity may lead to non-independence of interaction events, potentially strongly influencing network patterns and compromising inference. An example where such effects are particularly severe are the counts from pollen counts or fecal analysis, which are thus often not analysed in a fully quantitative way. This project has the potential to challenge conclusions of hundreds of published research papers.

**Methods:** This thesis will explore the influence of this effect on the estimation of specialization and on the significance of patterns inferred from null models. It will combine:

- data simulation using statistical models or (optionally) simple process-based models

- analysis of existing datasets (for which e.g. number of individuals interacting can be compared to the number of visits)

- exploration of solutions to the problem (e.g. log-transformation, using prevalence instead of fully quantity, hierarchical models, or own developed methods that explicitly account for overdispersion)**Suitable as:*** *BSc or MSc thesis project

**Requirements:** strong dedication to work with R, basic programming and statistics skills using R

**Time:** can start anytime.

**Contact: Dr. Jochen Fründ**, jochen.fruend@biom.uni-freiburg.de, 0761/203-3747

**Automatising statistical analyses**

**Automatising statistical analyses**

Why does every data set require the analyst to start over with all the things she has learned during her studies? Surely much of this can be automatised!

Apart from attempts to make human-readable output from statistical analyses, efforts to automatise even simple analyses have not made it onto the market. But some parts of a statistical analysis can surely be automatised, in a supportive way. For example, after fitting a model, model diagnostics should be relatively straight-forward to carry out and report automatically. Or a comparison of the fitted model with some hyperflexibel algorithm to see whether the model could be improved in principle. Or automatic proposals for the type of distribution to use, to deal with correlated predictors, or to plot main effects?

Here is your chance to have a go! In addition to the fun of inventing and implementing algorithms to automatically do something, you will realise why some things are not yet automatised.

This project has many potential dimensions. It could focus on traditional model diagnostics, or on automatised plotting, or on comparisons of GLMs with machine learning approaches to improve model structure, or ...

If you prefer, you can look at this project differently, in the context of "analyst degrees of freedom". The idea is that in any statistical analysis the analyst faces many decisions. Some are influential, others less so. As a consequence, the final p-values of an hypothesis test may be as reported, or may be distorted by the choices made. Implementing an "automatic statistician" as an interactive pipeline allows us to go through all combinations of decisions, in a factorial design, and evaluate which steps have large (bad) and which have small effects (good) on the correctness (nominal coverage) of the final p-value.

**Suitable as:*** *BSc/MSc project

**Requirements:** Willingness to engage in R programming and abstract thinking. Frustration tolerance to error messages.

**Time:** The project can start anytime.

**Contact:** **Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de

**Extinction scenarios for interaction networks**

**Extinction scenarios for interaction networks**

One interest in interaction networks in ecology is due to the idea that such network structure actually matters for the functioning of the system, or the robustness to change. Simulating extinctions from a network is thus a frequently used way to assess extinction consequences, even though the assumptions behind such simulations are ecologically not realistic. For example, one would expect a pollinator to change its preferences when the preferred flower species is absent, rather than simply go extinct.

Extinction sequences can thus be based on different assumptions: no adaptation, replacement by the most similar flower type, etc. Simulating such different extinction scenarios and quantifying the network robustness will show how different they actually are, and whether it is important to consider shifts in interactions.

Currently, only one study looked at such shifts in preferences, a.k.a. “rewiring” (Vizentin-Bugoni et al. 2019). It shows that there is indeed an effect, but it fails to identify the causes: is it due to abundance distributions, or to the way traits are distributed across species? Why does combining different traits have no effect? Such questions can only be addressed by simulating interaction networks and running them through different types of rewiring scenarios. That is exactly the idea for this project.

**Suitable as:*** MSc project*, requiring comfortable use of R and willingness to think abstractly about networks, without any obvious non-academic application.

**Literature:*** *Vizentin‐Bugoni, J., Debastiani, V. J., Bastazini, V. A. G., Maruyama, P. K., & Sperry, J. H. (2019). Including rewiring in the estimation of the robustness of mutualistic networks. *Methods in Ecology and Evolution*, *in press*. https://doi.org/10.1111/2041-210X.13306

**Contact:** **Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de

**Identify underlying processes in multi-state environmental data using exploratory statistics and deep learning**

**Identify underlying processes in multi-state environmental data using exploratory statistics and deep learning**

In an environmental system, system states are causally linked in complex ways. For example, soil moisture affects sap flow and photosynthesis, but more rain does not mean more sap flow. Such non-linear interrelationships can be represented, in principle, by deep neural networks. Since the monitored data comprise drivers (radiation, rainfall) as well as responses (sap flow, soil moisture), and since the relevant processes act at potentially very different time scales (minutes to weeks), it is unclear (a) what the potential deep learning offers, and (b) how to efficiently construct such networks for maximal information gain. The aim would be to then inspect the represented relationships in order to improve our understanding of the system.

In a first step, data will be simulated using an ecosystem model (e.g. Landscape-DNDC or alike), so as to be sure that the linkages between processes and scales are known.

Two approaches seem to be interesting starting points: autoencoder (AE) and reservoir computing. AE is akin to a non-linear PCA and tries to reduce the dimensionality of the data by finding a simple, if non-linear, representation. It consists of an encoder and a decoder step, where the first leads to a latent description, while the latter links this back to the data. Copula?

Reservoir computing (e.g. echo state networks), in contrast, targets dynamic systems and work through representing the input (including lagged versions of the input) in a fixed but large set of possible interactions (the reservoir). Being “fixed” means here that weights are assigned randomly. Only the output (or rather the “readout” layer) is then linked to the response variable through linear regression.

Regrettably it is unclear, which approach seems particularly suitable for the problem at hand, and in how far the combined fitting of several system states actually infers and advantage of separate state-wise modelling (i.e. building a model for each Y separately using some ML algorithm).

Data are provided by the CAOS project from hydrology, which are multiple years of 12 system states in 40 sites, assessed at hourly intervals. (Also WSL data for only 1 year, or anything from EcoSense coming up.)

**Contact: Carsten Dormann, **carsten.dormann@biom.uni-freiburg.de

**Process-integration into neural networks: using 3-PGN for PROFOUND**

**Process-integration into neural networks: using 3-PGN for PROFOUND**

Neural network are all the rage. They require representative data, however, i.e. data that describe the underlying processes well. For many environmental systems, we have a rather good process understanding, particularly in forest growth, forest C-fluxes, but also in hydrology. In this case, it would be silly to ignore this knowledge when fitting a flashy neural network to observed data.

This project shall implement and compare different ways to integrate a process model into neural networks. The basic approach has been implemented and tested for C-fluxes in a boreal forest and a simply ecophysiological model. Now, the next step is to use a somewhat more flexible forest growth model, which in principle also represents mixed stands, N-dynamics and management (3-PGN).

**Suitable as:*** *MSc-project, requires interest in “deep learning” and python. Python code and data are available for the previous process model.

**Contact:** *Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de*

**Literature:*** *Willard, J., Jia, X., Xu, S., Steinbach, M., & Kumar, V. (2021). Integrating scientific knowledge with machine learning for engineering and environmental systems. *ArXiv*, *2003.04919 [physics, stat]*. http://arxiv.org/abs/2003.04919

**State-space model for tree-ring growth**

**State-space model for tree-ring growth**

Analysis of tree-ring width is a very standardised statistical approach, but it is neither intuitive, nor would it be what I would do based on how we teach GLMs and mixed-effect models.

Actually, this kind of data is surprisingly messy: they feature temporal autocorrelation, non-linear growth, depending on both age and previous year’s growth, and environmental /stand conditions around the trees.

The approach would thus be to 1. analyse some data in the way “everybody” does, and compare that to an incrementally more complicated 2. analysis more in line with non-linear state-space models. Ideally, and dependent on the skills and progress, data should be simulated with a specific growth model in mind, and then both approaches should be compared to whether they recover the parameters used.

Data will be available from international data bases, but also from the Forest Growth & Dendrochronology lab.

**Contact:** *Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de*

**Literature:*** *

Bowman, D. M. J. S., Brienen, R. J. W., Gloor, E., Phillips, O. L., & Prior, L. D. (2013). Detecting trends in tree growth: Not so simple. *Trends in Plant Science*, *18*(1), 11–17. https://doi.org/10.1016/j.tplants.2012.08.005

Lundqvist, S.-O., Seifert, S., Grahn, T., Olsson, L., García-Gil, M. R., Karlsson, B., & Seifert, T. (2018). Age and weather effects on between and within ring variations of number, width and coarseness of tracheids and radial growth of young Norway spruce. *European Journal of Forest Research*, *137*(5), 719–743. https://doi.org/10.1007/s10342-018-1136-x

Schofield, M. R., Barker, R. J., Gelman, A., Cook, E. R., & Briffa, K. R. (2016). A model-based approach to climate reconstruction using tree-ring data. *Journal of the American Statistical Association*, *111*(513), 93–106. https://doi.org/10.1080/01621459.2015.1110524

Zhao, S., Pederson, N., D’Orangeville, L., HilleRisLambers, J., Boose, E., Penone, C., Bauer, B., Jiang, Y., & Manzanedo, R. D. (2019). The International Tree-Ring Data Bank (ITRDB) revisited: Data availability and global ecological representativity. *Journal of Biogeography*, *46*(2), 355–368. https://doi.org/10.1111/jbi.13488

**Confidence intervals for subsampled regression models**

**Confidence intervals for subsampled regression models**

Sometimes a data set is so big that you can't process it completely with one method (e.g. a spatial model that can't handle over 5000 data points). If we then takes a subsample, about 5000 points of the maybe 1 million data points in total, then we totally overestimate the error bar of the regression. Is there no way to correct this? Yes, you can. There is a nice statistical theory about this in the book "Subsampling" from 1999, but you have to estimate a parameter for it from the data (by trying subsamples of different sizes). According to my survey, no one has ever done this, at least not in ecology, but I think it is very practical, precisely because there are often models that we can only run on a subsample of the data.

The aim of the work would be to get this approach working on the basis of simulated data and to present it for a large data set (from a *Science* paper) (here a spatial GLS is to be approximated for >600000 data points). Pretty cool, probably a bit of fiddling, but in my opinion easy to complete, including a manuscript. A first step would be to review the literature and confirm the absence of this idea from it, as the opposite step has been taken for other purposes (“data cloning”).

**Suitable as:*** * Both as *BSc or MSc project*, requiring some interest in statistical analysis and R-coding. Allergy to mathematical equations would be problematic, even though the majority of the work would be demonstration-by-simulation, not through maths.

**Contact:** *Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de*

**Literature:*** *

Absent from the ecological literature, but for maths see, e.g., Wang, H., Zhu, R., & Ma, P. (2018). Optimal subsampling for large sample logistic regression. *Journal of the American Statistical Association*, *113*(522), 829–844. https://doi.org/10.1080/01621459.2017.1292914

**The impact of diel vertical migration on ocean carbon flux**

**The impact of diel vertical migration on ocean carbon flux**

The daily (“diel”) migration of zooplankton (and accompanying fish) from the sea surface to the deep dark during the day is the largest movement of biomass on earth. It is triggered by light, but the evolutionary cause is for zooplankton to avoid being consumed by their visually hunting predators. The consequence of DVM is that phytoplankton can reproduce largely unharmed during the day, thus assimilating more CO2 than if it was constantly grazed upon. Thus, it seems that DVM is actually not only optimizing survival of zooplankton, but also maximizing energy import into the pelagic sea. Or is it?

This theoretical ecology study aims at producing a simple predator-prey model to allow investigating the consequences of (a) switching on/off DVM, and (b) comparing tropical and polar regions of obviously very different day/night lengths. In polar regions, no DVM is observed: is this still maximizing energy import?

Models on DVM exist in the literature, but they are largely integrated into complex biogeochemical models of the ocean. This is not the aim of this “strategic” model, which should be parameterized for some processes (e.g. photosynthetic rate, foraging efficiency, migration rate), but aims to identify whether there is a detectable effect of DVM on C-fluxes.

**Suitable as:*** * BSc or MSc project, requiring interest in programming either differential or difference equations in R or Python.

**Contact:** **Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de

**Literature:** Stock, C., & Dunne, J. (2010). Controls on the ratio of mesozooplankton production to primary production in marine ecosystems. Deep Sea Research Part I: Oceanographic Research Papers, 57(1), 95–112. https://doi.org/10.1016/j.dsr.2009.10.006