MICOM workflows

The MICOM workflow API provides prebuilt solutions for the most common MICOM analyses across several samples. This will manage most of the workload for you and uses an efficient parallelization scheme to make use of several CPU cores. The workflow API is probably the best entry point for you if the following is true:

  1. You have at least 2 samples with taxonomy assignments and abundances for each

  2. You have chosen to use one of the preused model databases or have already built your own

  3. You want to run a set of standard analyses and visualization on the models

In that case the prebuilt workflows will make your analysis much simpler and faster and will take care of parallelizing your analyses. The worflow API can be mixed with the MICOM API at any point. So you could do a few steps with the workflows and then run you own analyses downstream from that. Additionally, you can directly import you start data from Qiime 2. See the Loading Qiime 2 data.

Building and models and simulating growth

Input formats

To start building community models for all your samples you will need to provide your data to MICOM. MICOM prefers to have the taxonomy and abundances for all samples in a single tidy DataFrame. Here each taxon in each sample is a row which provides its taxonomy and abundance. This may sound a bit confusing but should become pretty clear when looking at an example. MICOM can generate a simple example DataFrame which we can use as guidance.

[1]:
from micom.data import test_data

data = test_data()
data
[1]:
id genus species reactions metabolites file sample_id abundance
0 Escherichia_coli_1 Escherichia Escherichia coli 0 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_1 882
1 Escherichia_coli_2 Escherichia Escherichia coli 1 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_1 718
2 Escherichia_coli_3 Escherichia Escherichia coli 2 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_1 817
3 Escherichia_coli_4 Escherichia Escherichia coli 3 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_1 850
0 Escherichia_coli_1 Escherichia Escherichia coli 0 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_2 423
1 Escherichia_coli_2 Escherichia Escherichia coli 1 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_2 765
2 Escherichia_coli_3 Escherichia Escherichia coli 2 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_2 694
3 Escherichia_coli_4 Escherichia Escherichia coli 3 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_2 129
0 Escherichia_coli_1 Escherichia Escherichia coli 0 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_3 823
1 Escherichia_coli_2 Escherichia Escherichia coli 1 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_3 110
2 Escherichia_coli_3 Escherichia Escherichia coli 2 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_3 260
3 Escherichia_coli_4 Escherichia Escherichia coli 3 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_3 807
0 Escherichia_coli_1 Escherichia Escherichia coli 0 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_4 436
1 Escherichia_coli_2 Escherichia Escherichia coli 1 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_4 62
2 Escherichia_coli_3 Escherichia Escherichia coli 2 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_4 631
3 Escherichia_coli_4 Escherichia Escherichia coli 3 95 72 /home/cdiener/code/micom/micom/data/e_coli_cor... sample_4 479

This is very simple example where each sample contains 4 different E. coli species in random abundances. Thus, every sample has 4 rows in this DataFrame. The DataFrame also contains additional columns, the only required columns are “id”, “sample_id”, “abundance” and one column that provides the summary rank, here “species”.

Note that we also have an additional column “genus” here. The minimal taxonomic information you have to provide is only the name of the taxonomy rank matching the database you are using. So if you are using a genus-level database you will need a column “genus”. In this case we mill use a species-level database so we had to provide a column “species”. If there any additional columns from the set {"kingdom", "phylum", "class", "order", "family", "genus", "species"} those will be used to make the mapping with the database more stringent. For instance, here we provided a column “genus” which means models will only be counted as a “match” if the taxon has the same genus and species in the data and the model database.

Thus, the more taxonomic rank columns you include in the data you pass to MICOM, the more stringent MICOM will become matching to the reference database. This can be used to circumvent poorly matching ranks as well. For instance, if you know your data matches well by genus and phylum names but families are named differently even for the same taxa you can omit the “family” column from your data.

Building community models

To build a community sample for each of your sample you will need the abundance table as provided above and a model database. Usually we recommend to use one of the prebuilt MICOM database from https://doi.org/10.5281/zenodo.3755182. Additionally, you can also `create your own database <>`__.

For our example we have a custom species-level database that is bundled with MICOM. With the abundance table and database you can now start building your models by providing a folder where the assembled community models should be stored.

[2]:
from micom.data import test_db
from micom.workflows import build

manifest = build(data, out_folder="models", model_db=test_db, cutoff=0.0001, threads=2)

This will also allow you to specify a relative abundance cutoff for a taxon to be included with the cutoff argument. The default is to include only taxa that constitute at least 0.01% of the sample. Model building will be automatically parallelized over multiple CPUs and the number of cores/threads to use for should be set with the threads argument. The workflows will also warn you if for any samples less than 50% of the abundance was matched to the database. Since our data was random this may have happened here.

The build workflow will return a model manifest:

[3]:
manifest
[3]:
reactions metabolites file sample_id found_taxa total_taxa found_fraction found_abundance_fraction
0 95 72 sample_1.pickle sample_1 3.0 4.0 0.75 0.730028
1 95 72 sample_2.pickle sample_2 3.0 4.0 0.75 0.789657
2 95 72 sample_3.pickle sample_3 3.0 4.0 0.75 0.588500
3 95 72 sample_4.pickle sample_4 3.0 4.0 0.75 0.728856

This will propagate information from your input table as well as give you metrics on how well the samples were matched to the database. Our database only include models for the first 3 E. coli species so you see the workflow could only match 3/4 taxa for each sample. Probably the most important column is the found_abundance_fraction. This one tells which fraction of the sample abundance was matched to the database. You usually want this column to be above 0.5 so the majority of the sample is matched. A value of 1.0 would be perfect but is usually hard to achieve. Values around 0.8 are usually pretty good.

The file column denotes the filename for the built community within the folder specified as out_folder before. You can use the load_pickle function to read individual models and run custom analyses.

[4]:
from micom import load_pickle

com = load_pickle("models/sample_1.pickle")
print(len(com.reactions))
305

Simulating growth

With our built models we can now advance to simulating growth with MICOMs cooperative tradeoff algorithm. This will use the manifest we just generated but will also require a growth medium to be specified. A growth medium provided information which metabolites are available to the microbes for consumption and also provided an upper bound on the flux it is added to the system. This can be obtained from fluxomics data or approximated from growth media (for cultivation settings) or diet data (for gut microbiota models). Obtaining a correct media composition can be challenging. We will show some helper functions for that later on but for now we will use a pres-specified medium saved in Qiime 2 format. We also provide a growth medium describing an average Western diet for the AGORA model database at https://doi.org/10.5281/zenodo.3755182.

Growth media in MICOM are pretty simple DataFrames and can be read from a variety of formats. Here we will use the Qiime 2 Artifact format.

[5]:
from micom.data import test_medium
from micom.qiime_formats import load_qiime_medium

medium = load_qiime_medium(test_medium)
medium
[5]:
reaction flux metabolite
reaction
EX_glc__D_m EX_glc__D_m 10.000000 glc__D_m
EX_nh4_m EX_nh4_m 4.362240 nh4_m
EX_o2_m EX_o2_m 18.579253 o2_m
EX_pi_m EX_pi_m 2.942960 pi_m

As we can see a medium is simply a DataFrame with columns reaction and flux. Where reaction is the name of external exchange reaction in the model and flux is the upper bound (usually in mmol/gDW/h).

The last thing we need to choose is the tradeoff parameter for the growth simulation. This is explained in detail in the [Methods used by MICOM] section and expresses what fraction of maximum community growth is to be maintained while trying to maximize individual growth rates. The tradeoff takes values between 0 and 1 where zero denotes no community growth and and 1 denotes maximum community growth. We will use a vlue of 0.5 here.

[6]:
from micom.workflows import grow

res = grow(manifest, model_folder="models", medium=medium, tradeoff=0.5, threads=2)

This gives us a results tuple with three entries: growth_rates, exchanges, and annotations providing the growth rates, exchange fluxes and metabolite annotations, respectively. This could be passed on to the visualization workflows or you could run your own analyses on those DataFrames. But for now we will go back and look at some helper workflows to choose a tradeoff parameter and get media.

Choosing a tradeoff parameter

Results depend strongly on the tradeoff parameter. Even though values between 0.3-0.6 usually work well we recommend to run a tradeoff analysis to choose the best parameters for your data set and protocol. If you have already analyzed many samples in your lab and found a particular value to work well in general you may just use that but you should at least run this analysis once. In our paper we found that tradeoff best reproducing in vivo growth rates is the largest tradeoff that allows the majority of the bacteria to grow. Thus, the best tradeoff value is the value providing the best compromise between individual and cooperative growth.

The tradeoff workflow will run growth simulations with several tradeoff values and return the results.

[7]:
from micom.workflows import tradeoff

tradeoff_rates = tradeoff(manifest, model_folder="models", medium=medium, threads=2)
tradeoff_rates.head()

[7]:
abundance growth_rate reactions metabolites taxon tradeoff sample_id
compartments
Escherichia_coli_2 0.301048 0.000000 95 72 Escherichia_coli_2 NaN sample_1
Escherichia_coli_3 0.342558 1.433863 95 72 Escherichia_coli_3 NaN sample_1
Escherichia_coli_4 0.356394 0.866510 95 72 Escherichia_coli_4 NaN sample_1
Escherichia_coli_2 0.301048 0.718937 95 72 Escherichia_coli_2 1.0 sample_1
Escherichia_coli_3 0.342558 0.818066 95 72 Escherichia_coli_3 1.0 sample_1

As we see it returns growth rates for each taxon in each sample for various tradeoff values. There is also a tradeoff value of NaN which means optimization of the pure community growth rate without regularization which usually has very bad performance and is provided as a reference. To choose a good value we can count how many of the taxa can grow (growth rate > 1e-6) on average for each of the tradeoff values across all samples.

[8]:
tradeoff_rates.groupby("tradeoff").apply(
    lambda df: (df.growth_rate > 1e-6).sum()).reset_index()
[8]:
tradeoff 0
0 0.1 12
1 0.2 12
2 0.3 12
3 0.4 12
4 0.5 12
5 0.6 12
6 0.7 12
7 0.8 12
8 0.9 12
9 1.0 12

In that case all taxa (3 taxa for 4 samples each) can grow for all tradeoff values since we provided an excess of nutrients in the medium. So a tradeoff of 1.0 would have been the best here. For real data you will usually see those numbers decline for larger tradeoff values. A more detailed analysis can be performed with the `plot_tradeoff visualization <viz.html>`__.

Fixing growth media

Providing a growth medium may be complicated since you often only have some intuitions about a few components of the medium but lack information on others. Even when supplying putatively complete descriptions you will often observe that the models will predict the absence of growth since you are lacking an essential cofactor. To help with this MICOM provides a workflow that can complete any predefined growth medium with the minimal additional substrates to allow growth for all taxa in the database.

For instance let us assume we know that our E. coli samples consume Glucose and Oxygen. The respective exchange reactions are EX_glc__D_m and EX_o2_m. So we can start by building our candidate medium assuming we can import twice as much oxygen as glucose.

[9]:
import pandas as pd

candidate_medium = pd.DataFrame({"reaction": ["EX_glc__D_m", "EX_o2_m"], "flux": [10, 20]})
candidate_medium
[9]:
reaction flux
0 EX_glc__D_m 10
1 EX_o2_m 20

We can now ask MICOM to complete this medium by adding the smallest amount of overall flux so that all taxa in the database can grow with a growth rate off at least 0.1 1/h.

[10]:
from micom.workflows import complete_community_medium

medium = complete_community_medium(manifest, model_folder="models", medium=candidate_medium,
                    community_growth=0.1, min_growth=0.01,
                    max_import=10, threads=2)
medium

[10]:
reaction metabolite description flux
0 EX_glc__D_m glc__D_m D-Glucose 10.00000
1 EX_gln__L_m gln__L_m L-Glutamine 0.27264
2 EX_o2_m o2_m O2 20.00000
3 EX_pi_m pi_m Phosphate 0.36787

So we see that we can achieve growth by adding import for phosphate and glutamine.