MICOM workflows¶
The MICOM workflow API provides prebuilt solutions for the most common MICOM analyses across several samples. This will manage most of the workload for you and uses an efficient parallelization scheme to make use of several CPU cores. The workflow API is probably the best entry point for you if the following is true:
You have at least 2 samples with taxonomy assignments and abundances for each
You have chosen to use one of the preused model databases or have already built your own
You want to run a set of standard analyses and visualization on the models
In that case the prebuilt workflows will make your analysis much simpler and faster and will take care of parallelizing your analyses. The worflow API can be mixed with the MICOM API at any point. So you could do a few steps with the workflows and then run you own analyses downstream from that. Additionally, you can directly import you start data from Qiime 2. See the Loading Qiime 2 data.
Building and models and simulating growth¶
Input formats¶
To start building community models for all your samples you will need to provide your data to MICOM. MICOM prefers to have the taxonomy and abundances for all samples in a single tidy DataFrame. Here each taxon in each sample is a row which provides its taxonomy and abundance. This may sound a bit confusing but should become pretty clear when looking at an example. MICOM can generate a simple example DataFrame which we can use as guidance.
[1]:
from micom.data import test_data
data = test_data()
data
[1]:
id | genus | species | reactions | metabolites | file | sample_id | abundance | |
---|---|---|---|---|---|---|---|---|
0 | Escherichia_coli_1 | Escherichia | Escherichia coli 0 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_1 | 882 |
1 | Escherichia_coli_2 | Escherichia | Escherichia coli 1 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_1 | 718 |
2 | Escherichia_coli_3 | Escherichia | Escherichia coli 2 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_1 | 817 |
3 | Escherichia_coli_4 | Escherichia | Escherichia coli 3 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_1 | 850 |
0 | Escherichia_coli_1 | Escherichia | Escherichia coli 0 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_2 | 423 |
1 | Escherichia_coli_2 | Escherichia | Escherichia coli 1 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_2 | 765 |
2 | Escherichia_coli_3 | Escherichia | Escherichia coli 2 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_2 | 694 |
3 | Escherichia_coli_4 | Escherichia | Escherichia coli 3 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_2 | 129 |
0 | Escherichia_coli_1 | Escherichia | Escherichia coli 0 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_3 | 823 |
1 | Escherichia_coli_2 | Escherichia | Escherichia coli 1 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_3 | 110 |
2 | Escherichia_coli_3 | Escherichia | Escherichia coli 2 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_3 | 260 |
3 | Escherichia_coli_4 | Escherichia | Escherichia coli 3 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_3 | 807 |
0 | Escherichia_coli_1 | Escherichia | Escherichia coli 0 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_4 | 436 |
1 | Escherichia_coli_2 | Escherichia | Escherichia coli 1 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_4 | 62 |
2 | Escherichia_coli_3 | Escherichia | Escherichia coli 2 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_4 | 631 |
3 | Escherichia_coli_4 | Escherichia | Escherichia coli 3 | 95 | 72 | /home/cdiener/code/micom/micom/data/e_coli_cor... | sample_4 | 479 |
This is very simple example where each sample contains 4 different E. coli species in random abundances. Thus, every sample has 4 rows in this DataFrame. The DataFrame also contains additional columns, the only required columns are “id”, “sample_id”, “abundance” and one column that provides the summary rank, here “species”.
Note that we also have an additional column “genus” here. The minimal taxonomic information you have to provide is only the name of the taxonomy rank matching the database you are using. So if you are using a genus-level database you will need a column “genus”. In this case we mill use a species-level database so we had to provide a column “species”. If there any additional columns from the set {"kingdom", "phylum", "class", "order", "family", "genus", "species"}
those will be used to make
the mapping with the database more stringent. For instance, here we provided a column “genus” which means models will only be counted as a “match” if the taxon has the same genus and species in the data and the model database.
Thus, the more taxonomic rank columns you include in the data you pass to MICOM, the more stringent MICOM will become matching to the reference database. This can be used to circumvent poorly matching ranks as well. For instance, if you know your data matches well by genus and phylum names but families are named differently even for the same taxa you can omit the “family” column from your data.
Building community models¶
To build a community sample for each of your sample you will need the abundance table as provided above and a model database. Usually we recommend to use one of the prebuilt MICOM database from https://doi.org/10.5281/zenodo.3755182. Additionally, you can also `create your own database <>`__.
For our example we have a custom species-level database that is bundled with MICOM. With the abundance table and database you can now start building your models by providing a folder where the assembled community models should be stored.
[2]:
from micom.data import test_db
from micom.workflows import build
manifest = build(data, out_folder="models", model_db=test_db, cutoff=0.0001, threads=2)
This will also allow you to specify a relative abundance cutoff for a taxon to be included with the cutoff
argument. The default is to include only taxa that constitute at least 0.01% of the sample. Model building will be automatically parallelized over multiple CPUs and the number of cores/threads to use for should be set with the threads
argument. The workflows will also warn you if for any samples less than 50% of the abundance was matched to the database. Since our data was random
this may have happened here.
The build
workflow will return a model manifest:
[3]:
manifest
[3]:
reactions | metabolites | file | sample_id | found_taxa | total_taxa | found_fraction | found_abundance_fraction | |
---|---|---|---|---|---|---|---|---|
0 | 95 | 72 | sample_1.pickle | sample_1 | 3.0 | 4.0 | 0.75 | 0.730028 |
1 | 95 | 72 | sample_2.pickle | sample_2 | 3.0 | 4.0 | 0.75 | 0.789657 |
2 | 95 | 72 | sample_3.pickle | sample_3 | 3.0 | 4.0 | 0.75 | 0.588500 |
3 | 95 | 72 | sample_4.pickle | sample_4 | 3.0 | 4.0 | 0.75 | 0.728856 |
This will propagate information from your input table as well as give you metrics on how well the samples were matched to the database. Our database only include models for the first 3 E. coli species so you see the workflow could only match 3/4 taxa for each sample. Probably the most important column is the found_abundance_fraction
. This one tells which fraction of the sample abundance was matched to the database. You usually want this column to be above 0.5 so the majority of the sample
is matched. A value of 1.0 would be perfect but is usually hard to achieve. Values around 0.8 are usually pretty good.
The file
column denotes the filename for the built community within the folder specified as out_folder
before. You can use the load_pickle
function to read individual models and run custom analyses.
[4]:
from micom import load_pickle
com = load_pickle("models/sample_1.pickle")
print(len(com.reactions))
305
Simulating growth¶
With our built models we can now advance to simulating growth with MICOMs cooperative tradeoff
algorithm. This will use the manifest we just generated but will also require a growth medium to be specified. A growth medium provided information which metabolites are available to the microbes for consumption and also provided an upper bound on the flux it is added to the system. This can be obtained from fluxomics data or approximated from growth media (for cultivation settings) or diet data
(for gut microbiota models). Obtaining a correct media composition can be challenging. We will show some helper functions for that later on but for now we will use a pres-specified medium saved in Qiime 2 format. We also provide a growth medium describing an average Western diet for the AGORA model database at https://doi.org/10.5281/zenodo.3755182.
Growth media in MICOM are pretty simple DataFrames and can be read from a variety of formats. Here we will use the Qiime 2 Artifact format.
[5]:
from micom.data import test_medium
from micom.qiime_formats import load_qiime_medium
medium = load_qiime_medium(test_medium)
medium
[5]:
reaction | flux | metabolite | |
---|---|---|---|
reaction | |||
EX_glc__D_m | EX_glc__D_m | 10.000000 | glc__D_m |
EX_nh4_m | EX_nh4_m | 4.362240 | nh4_m |
EX_o2_m | EX_o2_m | 18.579253 | o2_m |
EX_pi_m | EX_pi_m | 2.942960 | pi_m |
As we can see a medium is simply a DataFrame with columns reaction
and flux
. Where reaction is the name of external exchange reaction in the model and flux is the upper bound (usually in mmol/gDW/h).
The last thing we need to choose is the tradeoff
parameter for the growth simulation. This is explained in detail in the [Methods used by MICOM] section and expresses what fraction of maximum community growth is to be maintained while trying to maximize individual growth rates. The tradeoff
takes values between 0 and 1 where zero denotes no community growth and and 1 denotes maximum community growth. We will use a vlue of 0.5 here.
[6]:
from micom.workflows import grow
res = grow(manifest, model_folder="models", medium=medium, tradeoff=0.5, threads=2)
This gives us a results tuple with three entries: growth_rates
, exchanges
, and annotations
providing the growth rates, exchange fluxes and metabolite annotations, respectively. This could be passed on to the visualization workflows or you could run your own analyses on those DataFrames. But for now we will go back and look at some helper workflows to choose a tradeoff parameter and get media.
Choosing a tradeoff parameter¶
Results depend strongly on the tradeoff parameter. Even though values between 0.3-0.6 usually work well we recommend to run a tradeoff analysis to choose the best parameters for your data set and protocol. If you have already analyzed many samples in your lab and found a particular value to work well in general you may just use that but you should at least run this analysis once. In our paper we found that tradeoff best reproducing in vivo growth rates is the largest tradeoff that allows the majority of the bacteria to grow. Thus, the best tradeoff value is the value providing the best compromise between individual and cooperative growth.
The tradeoff
workflow will run growth simulations with several tradeoff values and return the results.
[7]:
from micom.workflows import tradeoff
tradeoff_rates = tradeoff(manifest, model_folder="models", medium=medium, threads=2)
tradeoff_rates.head()
[7]:
abundance | growth_rate | reactions | metabolites | taxon | tradeoff | sample_id | |
---|---|---|---|---|---|---|---|
compartments | |||||||
Escherichia_coli_2 | 0.301048 | 0.000000 | 95 | 72 | Escherichia_coli_2 | NaN | sample_1 |
Escherichia_coli_3 | 0.342558 | 1.433863 | 95 | 72 | Escherichia_coli_3 | NaN | sample_1 |
Escherichia_coli_4 | 0.356394 | 0.866510 | 95 | 72 | Escherichia_coli_4 | NaN | sample_1 |
Escherichia_coli_2 | 0.301048 | 0.718937 | 95 | 72 | Escherichia_coli_2 | 1.0 | sample_1 |
Escherichia_coli_3 | 0.342558 | 0.818066 | 95 | 72 | Escherichia_coli_3 | 1.0 | sample_1 |
As we see it returns growth rates for each taxon in each sample for various tradeoff values. There is also a tradeoff value of NaN
which means optimization of the pure community growth rate without regularization which usually has very bad performance and is provided as a reference. To choose a good value we can count how many of the taxa can grow (growth rate > 1e-6) on average for each of the tradeoff values across all samples.
[8]:
tradeoff_rates.groupby("tradeoff").apply(
lambda df: (df.growth_rate > 1e-6).sum()).reset_index()
[8]:
tradeoff | 0 | |
---|---|---|
0 | 0.1 | 12 |
1 | 0.2 | 12 |
2 | 0.3 | 12 |
3 | 0.4 | 12 |
4 | 0.5 | 12 |
5 | 0.6 | 12 |
6 | 0.7 | 12 |
7 | 0.8 | 12 |
8 | 0.9 | 12 |
9 | 1.0 | 12 |
In that case all taxa (3 taxa for 4 samples each) can grow for all tradeoff values since we provided an excess of nutrients in the medium. So a tradeoff of 1.0 would have been the best here. For real data you will usually see those numbers decline for larger tradeoff values. A more detailed analysis can be performed with the `plot_tradeoff
visualization <viz.html>`__.
Fixing growth media¶
Providing a growth medium may be complicated since you often only have some intuitions about a few components of the medium but lack information on others. Even when supplying putatively complete descriptions you will often observe that the models will predict the absence of growth since you are lacking an essential cofactor. To help with this MICOM provides a workflow that can complete any predefined growth medium with the minimal additional substrates to allow growth for all taxa in the database.
For instance let us assume we know that our E. coli samples consume Glucose and Oxygen. The respective exchange reactions are EX_glc__D_m
and EX_o2_m
. So we can start by building our candidate medium assuming we can import twice as much oxygen as glucose.
[9]:
import pandas as pd
candidate_medium = pd.DataFrame({"reaction": ["EX_glc__D_m", "EX_o2_m"], "flux": [10, 20]})
candidate_medium
[9]:
reaction | flux | |
---|---|---|
0 | EX_glc__D_m | 10 |
1 | EX_o2_m | 20 |
We can now ask MICOM to complete this medium by adding the smallest amount of overall flux so that all taxa in the database can grow with a growth rate off at least 0.1 1/h.
[10]:
from micom.workflows import complete_community_medium
medium = complete_community_medium(manifest, model_folder="models", medium=candidate_medium,
community_growth=0.1, min_growth=0.01,
max_import=10, threads=2)
medium
[10]:
reaction | metabolite | description | flux | |
---|---|---|---|---|
0 | EX_glc__D_m | glc__D_m | D-Glucose | 10.00000 |
1 | EX_gln__L_m | gln__L_m | L-Glutamine | 0.27264 |
2 | EX_o2_m | o2_m | O2 | 20.00000 |
3 | EX_pi_m | pi_m | Phosphate | 0.36787 |
So we see that we can achieve growth by adding import for phosphate and glutamine.