SMALLER GENES DETECTED IN FLYE - possible reasons (lest, I forget)
the median size of small genes detected in flye is ~400 bp, and there are others even smaller
given the SR sequencing is typically 2x150 bp, it is plausible that we miss these “smaller” genes with the SR sequencing and subsequent assembly
also the possible explanation for the disparity with the “UniProt/TrEmbl” database, i.e. the 0.5-07 ratios when mapping to TrEmbl
explanation: the TrEmbl database was in the past created using SR-sequencing, i.e. “short genes” may not have been recorded then. whereas ``flye seems to capture them because of the possibly longer sequences being fed into the sequencer itself.
New findings: In the flye assembly, many genes with low query/subject length ratio (<= 0.5) appear in clusters. This is not the case of the megahit assembly.
filter diamond hits be query/subject length ratio (<= 0.5)
sort by query ID (contains contig and gene number)
find clusters of consecutive gene numbers (and same subject ID)
/scratch/users/vgalata/GDB/results/annotation/diamond/lr/flye/proteins.filtered.annot.tsv:looking for consecutive genes (clusters) where subject ID does not matter75460 total hits49632 (65.77%) queries are in clusters14770 clusters w/ average length of 3.36 genes/scratch/users/vgalata/GDB/results/annotation/diamond/lr/flye/proteins.filtered.annot.tsv:looking for consecutive genes (clusters) where subject ID does matter75460 total hits24253 (32.14%) queries are in clusters9810 clusters w/ average length of 2.47 genes
/scratch/users/vgalata/GDB/results/annotation/diamond/sr/megahit/proteins.filtered.annot.tsv:looking for consecutive genes (clusters) where subject ID does not matter20397 total hits3016 (14.79%) queries are in clusters1459 clusters w/ average length of 2.07 genes/scratch/users/vgalata/GDB/results/annotation/diamond/sr/megahit/proteins.filtered.annot.tsv:looking for consecutive genes (clusters) where subject ID does matter20397 total hits250 (1.23%) queries are in clusters124 clusters w/ average length of 2.02 genes
| From: | Mikhail Kolmogorov <fenderglass@gmail.com> ||----------|--------------------------------------------------------------------------------------|| To: | Cedric Christian LACZNY <cedric.laczny@uni.lu> || Cc: | Susheel Bhanu BUSI <susheel.busi@uni.lu>, Valentina GALATA <valentina.galata@uni.lu> || Subject: | Re: Question about discrepancy in assemblies using ONT and ONT + Illumina || Date: | Wed, 19 Aug 2020 14:09:53 -0700 (19/08/20 23:09:53) |I think it is very likely that the effect is caused by the remaining indels -something that we have also seen in our analysis.My guess is that an indel can potentially split a gene into parts, thus artificially increasing the protein counts. You might try to check the predicted ORF length distribution might help to find out if this is indeed the case (e.g. I expect Flye to have more short proteins and less long proteins, as compared to the other assemblers in this case).Also, it makes sense to polish Flye assembly with Illumina and check if it changes the statistics.We used both Prodigal and MetaGeneMark (incorporated into QUAST) for the analysis, but both tools seem to be in agreement at large.