MICOM workflows¶

The MICOM workflow API provides prebuilt solutions for the most common MICOM analyses across several samples. This will manage most of the workload for you and uses an efficient parallelization scheme to make use of several CPU cores. The workflow API is probably the best entry point for you if the following is true:

You have at least 2 samples with taxonomy assignments and abundances for each
You have chosen to use one of the preused model databases or have already built your own
You want to run a set of standard analyses and visualization on the models

In that case the prebuilt workflows will make your analysis much simpler and faster and will take care of parallelizing your analyses. The worflow API can be mixed with the MICOM API at any point. So you could do a few steps with the workflows and then run you own analyses downstream from that. Additionally, you can directly import you start data from Qiime 2. See the Loading Qiime 2 data.

Building and models and simulating growth¶

Input formats¶

To start building community models for all your samples you will need to provide your data to MICOM. MICOM prefers to have the taxonomy and abundances for all samples in a single tidy DataFrame. Here each taxon in each sample is a row which provides its taxonomy and abundance. This may sound a bit confusing but should become pretty clear when looking at an example. MICOM can generate a simple example DataFrame which we can use as guidance.

[1]:

from micom.data import test_data

data = test_data()
data

[1]:

	id	genus	species	reactions	metabolites	file	sample_id	abundance
0	Escherichia_coli_1	Escherichia	Escherichia coli 0	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_1	882
1	Escherichia_coli_2	Escherichia	Escherichia coli 1	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_1	718
2	Escherichia_coli_3	Escherichia	Escherichia coli 2	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_1	817
3	Escherichia_coli_4	Escherichia	Escherichia coli 3	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_1	850
0	Escherichia_coli_1	Escherichia	Escherichia coli 0	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_2	423
1	Escherichia_coli_2	Escherichia	Escherichia coli 1	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_2	765
2	Escherichia_coli_3	Escherichia	Escherichia coli 2	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_2	694
3	Escherichia_coli_4	Escherichia	Escherichia coli 3	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_2	129
0	Escherichia_coli_1	Escherichia	Escherichia coli 0	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_3	823
1	Escherichia_coli_2	Escherichia	Escherichia coli 1	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_3	110
2	Escherichia_coli_3	Escherichia	Escherichia coli 2	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_3	260
3	Escherichia_coli_4	Escherichia	Escherichia coli 3	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_3	807
0	Escherichia_coli_1	Escherichia	Escherichia coli 0	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_4	436
1	Escherichia_coli_2	Escherichia	Escherichia coli 1	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_4	62
2	Escherichia_coli_3	Escherichia	Escherichia coli 2	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_4	631
3	Escherichia_coli_4	Escherichia	Escherichia coli 3	95	72	/home/cdiener/code/micom/micom/data/e_coli_cor...	sample_4	479

This is very simple example where each sample contains 4 different E. coli species in random abundances. Thus, every sample has 4 rows in this DataFrame. The DataFrame also contains additional columns, the only required columns are “id”, “sample_id”, “abundance” and one column that provides the summary rank, here “species”.

Note that we also have an additional column “genus” here. The minimal taxonomic information you have to provide is only the name of the taxonomy rank matching the database you are using. So if you are using a genus-level database you will need a column “genus”. In this case we mill use a species-level database so we had to provide a column “species”. If there any additional columns from the set {"kingdom", "phylum", "class", "order", "family", "genus", "species"} those will be used to make the mapping with the database more stringent. For instance, here we provided a column “genus” which means models will only be counted as a “match” if the taxon has the same genus and species in the data and the model database.

Thus, the more taxonomic rank columns you include in the data you pass to MICOM, the more stringent MICOM will become matching to the reference database. This can be used to circumvent poorly matching ranks as well. For instance, if you know your data matches well by genus and phylum names but families are named differently even for the same taxa you can omit the “family” column from your data.

Building community models¶

To build a community sample for each of your sample you will need the abundance table as provided above and a model database. Usually we recommend to use one of the prebuilt MICOM database from https://doi.org/10.5281/zenodo.3755182. Additionally, you can also `create your own database <>`__.

For our example we have a custom species-level database that is bundled with MICOM. With the abundance table and database you can now start building your models by providing a folder where the assembled community models should be stored.

[2]:

from micom.data import test_db
from micom.workflows import build

manifest = build(data, out_folder="models", model_db=test_db, cutoff=0.0001, threads=2)

This will also allow you to specify a relative abundance cutoff for a taxon to be included with the cutoff argument. The default is to include only taxa that constitute at least 0.01% of the sample. Model building will be automatically parallelized over multiple CPUs and the number of cores/threads to use for should be set with the threads argument. The workflows will also warn you if for any samples less than 50% of the abundance was matched to the database. Since our data was random this may have happened here.

The build workflow will return a model manifest:

[3]:

manifest

[3]:

	reactions	metabolites	file	sample_id	found_taxa	total_taxa	found_fraction	found_abundance_fraction
0	95	72	sample_1.pickle	sample_1	3.0	4.0	0.75	0.730028
1	95	72	sample_2.pickle	sample_2	3.0	4.0	0.75	0.789657
2	95	72	sample_3.pickle	sample_3	3.0	4.0	0.75	0.588500
3	95	72	sample_4.pickle	sample_4	3.0	4.0	0.75	0.728856

This will propagate information from your input table as well as give you metrics on how well the samples were matched to the database. Our database only include models for the first 3 E. coli species so you see the workflow could only match 3/4 taxa for each sample. Probably the most important column is the found_abundance_fraction. This one tells which fraction of the sample abundance was matched to the database. You usually want this column to be above 0.5 so the majority of the sample is matched. A value of 1.0 would be perfect but is usually hard to achieve. Values around 0.8 are usually pretty good.

The file column denotes the filename for the built community within the folder specified as out_folder before. You can use the load_pickle function to read individual models and run custom analyses.

[4]:

from micom import load_pickle

com = load_pickle("models/sample_1.pickle")
print(len(com.reactions))

Simulating growth¶

With our built models we can now advance to simulating growth with MICOMs cooperative tradeoff algorithm. This will use the manifest we just generated but will also require a growth medium to be specified. A growth medium provided information which metabolites are available to the microbes for consumption and also provided an upper bound on the flux it is added to the system. This can be obtained from fluxomics data or approximated from growth media (for cultivation settings) or diet data (for gut microbiota models). Obtaining a correct media composition can be challenging. We will show some helper functions for that later on but for now we will use a pres-specified medium saved in Qiime 2 format. We also provide a growth medium describing an average Western diet for the AGORA model database at https://doi.org/10.5281/zenodo.3755182.

Growth media in MICOM are pretty simple DataFrames and can be read from a variety of formats. Here we will use the Qiime 2 Artifact format.

[5]:

from micom.data import test_medium
from micom.qiime_formats import load_qiime_medium

medium = load_qiime_medium(test_medium)
medium

[5]:

	reaction	flux	metabolite
reaction
EX_glc__D_m	EX_glc__D_m	10.000000	glc__D_m
EX_nh4_m	EX_nh4_m	4.362240	nh4_m
EX_o2_m	EX_o2_m	18.579253	o2_m
EX_pi_m	EX_pi_m	2.942960	pi_m

As we can see a medium is simply a DataFrame with columns reaction and flux. Where reaction is the name of external exchange reaction in the model and flux is the upper bound (usually in mmol/gDW/h).

The last thing we need to choose is the tradeoff parameter for the growth simulation. This is explained in detail in the [Methods used by MICOM] section and expresses what fraction of maximum community growth is to be maintained while trying to maximize individual growth rates. The tradeoff takes values between 0 and 1 where zero denotes no community growth and and 1 denotes maximum community growth. We will use a vlue of 0.5 here.

[6]:

from micom.workflows import grow

res = grow(manifest, model_folder="models", medium=medium, tradeoff=0.5, threads=2)

This gives us a results tuple with three entries: growth_rates, exchanges, and annotations providing the growth rates, exchange fluxes and metabolite annotations, respectively. This could be passed on to the visualization workflows or you could run your own analyses on those DataFrames. But for now we will go back and look at some helper workflows to choose a tradeoff parameter and get media.

Choosing a tradeoff parameter¶

Results depend strongly on the tradeoff parameter. Even though values between 0.3-0.6 usually work well we recommend to run a tradeoff analysis to choose the best parameters for your data set and protocol. If you have already analyzed many samples in your lab and found a particular value to work well in general you may just use that but you should at least run this analysis once. In our paper we found that tradeoff best reproducing in vivo growth rates is the largest tradeoff that allows the majority of the bacteria to grow. Thus, the best tradeoff value is the value providing the best compromise between individual and cooperative growth.

The tradeoff workflow will run growth simulations with several tradeoff values and return the results.

[7]:

from micom.workflows import tradeoff

tradeoff_rates = tradeoff(manifest, model_folder="models", medium=medium, threads=2)
tradeoff_rates.head()

[7]:

	abundance	growth_rate	reactions	metabolites	taxon	tradeoff	sample_id
compartments
Escherichia_coli_2	0.301048	0.000000	95	72	Escherichia_coli_2	NaN	sample_1
Escherichia_coli_3	0.342558	1.433863	95	72	Escherichia_coli_3	NaN	sample_1
Escherichia_coli_4	0.356394	0.866510	95	72	Escherichia_coli_4	NaN	sample_1
Escherichia_coli_2	0.301048	0.718937	95	72	Escherichia_coli_2	1.0	sample_1
Escherichia_coli_3	0.342558	0.818066	95	72	Escherichia_coli_3	1.0	sample_1

As we see it returns growth rates for each taxon in each sample for various tradeoff values. There is also a tradeoff value of NaN which means optimization of the pure community growth rate without regularization which usually has very bad performance and is provided as a reference. To choose a good value we can count how many of the taxa can grow (growth rate > 1e-6) on average for each of the tradeoff values across all samples.

[8]:

tradeoff_rates.groupby("tradeoff").apply(
    lambda df: (df.growth_rate > 1e-6).sum()).reset_index()

[8]:

	tradeoff	0
0	0.1	12
1	0.2	12
2	0.3	12
3	0.4	12
4	0.5	12
5	0.6	12
6	0.7	12
7	0.8	12
8	0.9	12
9	1.0	12

In that case all taxa (3 taxa for 4 samples each) can grow for all tradeoff values since we provided an excess of nutrients in the medium. So a tradeoff of 1.0 would have been the best here. For real data you will usually see those numbers decline for larger tradeoff values. A more detailed analysis can be performed with the `plot_tradeoff visualization <viz.html>`__.

Fixing growth media¶

Providing a growth medium may be complicated since you often only have some intuitions about a few components of the medium but lack information on others. Even when supplying putatively complete descriptions you will often observe that the models will predict the absence of growth since you are lacking an essential cofactor. To help with this MICOM provides a workflow that can complete any predefined growth medium with the minimal additional substrates to allow growth for all taxa in the database.

For instance let us assume we know that our E. coli samples consume Glucose and Oxygen. The respective exchange reactions are EX_glc__D_m and EX_o2_m. So we can start by building our candidate medium assuming we can import twice as much oxygen as glucose.

[9]:

import pandas as pd

candidate_medium = pd.DataFrame({"reaction": ["EX_glc__D_m", "EX_o2_m"], "flux": [10, 20]})
candidate_medium

[9]:

	reaction	flux
0	EX_glc__D_m	10
1	EX_o2_m	20

We can now ask MICOM to complete this medium by adding the smallest amount of overall flux so that all taxa in the database can grow with a growth rate off at least 0.1 1/h.

[10]:

from micom.workflows import complete_community_medium

medium = complete_community_medium(manifest, model_folder="models", medium=candidate_medium,
                    community_growth=0.1, min_growth=0.01,
                    max_import=10, threads=2)
medium

[10]:

	reaction	metabolite	description	flux
0	EX_glc__D_m	glc__D_m	D-Glucose	10.00000
1	EX_gln__L_m	gln__L_m	L-Glutamine	0.27264
2	EX_o2_m	o2_m	O2	20.00000
3	EX_pi_m	pi_m	Phosphate	0.36787

So we see that we can achieve growth by adding import for phosphate and glutamine.

	tradeoff	0
0	0.1	12
1	0.2	12
2	0.3	12
3	0.4	12
4	0.5	12
5	0.6	12
6	0.7	12
7	0.8	12
8	0.9	12
9	1.0	12

	tradeoff	0
0	0.1	12
1	0.2	12
2	0.3	12
3	0.4	12
4	0.5	12
5	0.6	12
6	0.7	12
7	0.8	12
8	0.9	12
9	1.0	12

	tradeoff	0
0	0.1	12
1	0.2	12
2	0.3	12
3	0.4	12
4	0.5	12
5	0.6	12
6	0.7	12
7	0.8	12
8	0.9	12
9	1.0	12