hypothesis - high error rate in genes in flye assembly

changed milestone to %Analyses

added Analysis Investigation labels

`checkm`

Find genes which should be (highly) conserved using checkm

web site
repo
wiki
data
- no need to download if installing via conda

Having second thoughts about that... No sure whether all genes used by checkm are "essential" and conserved.

`barrnap`

Identifying and comparing rRNA genes

repo
supported organisms: Bacteria, Archaea, Eukaryota, Metazoan Mitochondria

find . -name "GDB_*.fna" | sort | while read f; do for k in mito bac arc euk; do echo "${f}" && barrnap --threads 10 --kingdom ${k} --outseq barrnap/$(basename -s '.fna' ${f}).${k}.fa < ${f} > barrnap/$(basename -s '.fna' ${f}).${k}.gff; done; done

find barrnap/ -type f -name "*.bac.gff" | sort | while read f; do echo -e "${f}\n$(grep -v "^#" ${f} | cut -f9 | cut -d";" -f1,2 | sed 's/;/\t/g' | sed 's/(partial)/\tpartial/' | sort | uniq -c | sort -k2)"; done

Problem of multiple copies

`GToTree`

hmmsearch --cut_ga --cpu 10 --tblout bac.tblout -A bac.align Bacteria.hmm prot.faa > /dev/null

Not straightforward to extract query and subject coverage from the output of hmmsearch.

marked this issue as related to #69 (closed)

`DIAMOND`

command line options

parameters to try:

--sensitive, --more-sensitive
- w/o any sensitivity option: hits of >70% identity and short read alignment
--query-cover and --subject-cover, e.g. 95 or higher
--id, e.g. 95
--unal 1 to output unaligned queries

in an interactive session:

srun -p bigmem --time=1:00:0 -N1 -n1 -c5 --pty bash -i

marked this issue as related to #73 (closed)

SMALLER GENES DETECTED IN FLYE - possible reasons (lest, I forget)

the median size of small genes detected in flye is ~400 bp, and there are others even smaller
given the SR sequencing is typically 2x150 bp, it is plausible that we miss these “smaller” genes with the SR sequencing and subsequent assembly
also the possible explanation for the disparity with the “UniProt/TrEmbl” database, i.e. the 0.5-07 ratios when mapping to TrEmbl
explanation: the TrEmbl database was in the past created using SR-sequencing, i.e. “short genes” may not have been recorded then. whereas ``flye seems to capture them because of the possibly longer sequences being fed into the sequencer itself.

New findings: In the flye assembly, many genes with low query/subject length ratio (<= 0.5) appear in clusters. This is not the case of the megahit assembly.

filter diamond hits be query/subject length ratio (<= 0.5)
sort by query ID (contains contig and gene number)
find clusters of consecutive gene numbers (and same subject ID)

/scratch/users/vgalata/GDB/results/annotation/diamond/lr/flye/proteins.filtered.annot.tsv:
looking for consecutive genes (clusters) where subject ID does not matter
75460 total hits
49632 (65.77%) queries are in clusters
14770 clusters w/ average length of 3.36 genes

/scratch/users/vgalata/GDB/results/annotation/diamond/lr/flye/proteins.filtered.annot.tsv:
looking for consecutive genes (clusters) where subject ID does matter
75460 total hits
24253 (32.14%) queries are in clusters
9810 clusters w/ average length of 2.47 genes

/scratch/users/vgalata/GDB/results/annotation/diamond/sr/megahit/proteins.filtered.annot.tsv:
looking for consecutive genes (clusters) where subject ID does not matter
20397 total hits
3016 (14.79%) queries are in clusters
1459 clusters w/ average length of 2.07 genes

/scratch/users/vgalata/GDB/results/annotation/diamond/sr/megahit/proteins.filtered.annot.tsv:
looking for consecutive genes (clusters) where subject ID does matter
20397 total hits
250 (1.23%) queries are in clusters
124 clusters w/ average length of 2.02 genes

Quick check how many proteins in flye have low q/s length ratio and how many were not found in megahit:

cd /scratch/users/vgalata/GDB/results

# intersections
comm -13 <(tail -n +2 annotation/diamond/lr/flye/proteins.filtered.annot.tsv | cut -f 1 | sort) <(grep "^>" analysis/cdhit/sr_megahit__lr_flye__test.faa | sed "s/^>//" | cut -d" " -f1 | sort) | wc -l
# 29721
comm -23 <(tail -n +2 annotation/diamond/lr/flye/proteins.filtered.annot.tsv | cut -f 1 | sort) <(grep "^>" analysis/cdhit/sr_megahit__lr_flye__test.faa | sed "s/^>//" | cut -d" " -f1 | sort) | wc -l
# 40897
comm -12 <(tail -n +2 annotation/diamond/lr/flye/proteins.filtered.annot.tsv | cut -f 1 | sort) <(grep "^>" analysis/cdhit/sr_megahit__lr_flye__test.faa | sed "s/^>//" | cut -d" " -f1 | sort) | wc -l
# 34563

# total counts
grep "^>" analysis/cdhit/sr_megahit__lr_flye__test.faa | wc -l
# 64284
wc -l annotation/diamond/lr/flye/proteins.filtered.annot.tsv
# 75461 annotation/diamond/lr/flye/proteins.filtered.annot.tsv # - 1 for the header

So what are the ~30K that are unique to megahit? Can we do the same analyses using taxonomy (proteins)?

i.e. which taxon/protein names overlap and which are unique?

mentioned in commit 14d7afdf

| From:    | Mikhail Kolmogorov <fenderglass@gmail.com>                                           |
|----------|--------------------------------------------------------------------------------------|
| To:      | Cedric Christian LACZNY <cedric.laczny@uni.lu>                                       |
| Cc:      | Susheel Bhanu BUSI <susheel.busi@uni.lu>, Valentina GALATA <valentina.galata@uni.lu> |
| Subject: | Re: Question about discrepancy in assemblies using ONT and ONT + Illumina            |
| Date:    | Wed, 19 Aug 2020 14:09:53 -0700 (19/08/20 23:09:53)                                  |

I think it is very likely that the effect is caused by the remaining indels -
something that we have also seen in our analysis.
My guess is that an indel can potentially split a gene into parts, thus artificially increasing the protein counts. 
You might try to check the predicted ORF length distribution might help to find out if this is indeed the case (e.g. I expect Flye to have more short proteins and less long proteins, as compared to the other assemblers in this case).
Also, it makes sense to polish Flye assembly with Illumina and check if it changes the statistics.

We used both Prodigal and MetaGeneMark (incorporated into QUAST) for the analysis, but both tools seem to be in agreement at large.

Closing as we have seen that polishing

reduces the number of short genes
reduces the number of unique genes
improves the query/subject length ratio of hits to UniProt/Trembl

closed

hypothesis - high error rate in genes in flye assembly

Designs

Child items ...

Activity

`checkm`

`barrnap`

`GToTree`

`DIAMOND`

hypothesis - high error rate in genes in flye assembly

Relates to

Activity

checkm

barrnap

GToTree

DIAMOND

`checkm`

`barrnap`

`GToTree`

`DIAMOND`