However, I do have an interest in Bioinformatics/Computational Biology. Unfortunately, I find that most textbooks that cover these topics are very theoretical, e.g., going over how Blast algorithm works, explanations of NGS assembly and annotation.
I'm curious any peeps in the field have any simple learning projects/practical papers for a programmer interested in Bioinformatics?
Things I'm thinking of are, (would like to hear from the experts any hints, tips or links):
1) Downloading a GWAS (Genome Wide Association) data-set for a psychiatric or cancer disease and replicating the statistical techniques that scientists used to determine the clusters of responsible genes (e.g., http://www.nature.com/nature/journal/v511/n7510/full/nature1...)
2) Downloading different strains of Cannabis Indica and Sativa genome draft assemblies; running initial phylenlogy tree and run genome annotation pipelines to identify known genes; and using RNASeq data to isolate the potencies of different varieties of Cannabis and metabolic pathways to produce hemp (e.g., http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3359589/)
3) Downloading data-set of RNASeq data from synthetic biology experiments involving yeast and replicating the statistical technique that authors used to determine the gene regulators/promoters used to program the cells to produce plastic, fuel and drugs for them! (http://www.nature.com/nature/journal/v440/n7086/full/nature0...)
[1] https://liorpachter.wordpress.com/2014/12/30/the-two-culture...
I mean, if you just want to replicate what they did, that's fine, but it kind of misses the point of it all. Science is about finding something new.
Why spin your wheels redoing someone else's work, when you can push into new territory? Science is all about having a question, and then just "learn by doing" until you find an answer.
>Things I'm thinking of are
GWAS data is messy, and usually has already been picked clean by the time it has been published. They do get pretty sophisticated with the stats analysis though.
The Cannabis reference sequence is awful (I should know, I was the first person to make an assembly). Plants have really really complex genomes, and there isn't the research interest to improve the references. This, combined with the fact there is almost zero available RNAseq data would make this project difficult. Yes, there is some RNAseq data, but to separate signal from noise you'd need several technical replicates, which we don't have. As for phylogeny, the macro phylogeny is pretty well known from 16s sequence, and we don't have enough individuals to tease out interesting population structure. Genome annotation, in the absence of molecular work, is always boring because it can only find what is already known from other organisms.
Most all of that experiment you linked was molecular work. They PCR'd up some genes and transformed them into yeast. They didn't use stats to identify genes because the genes were already well known and sequenced. After that it was just a matter of putting the genes in front of a high output promoter and thats that. If you had RNAseq data for that experiment, which we don't, you wouldn't be able to see the difference in regulation because the transgene isn't part of the yeast reference genome. You could do a de novo transcriptome assembly, and then it would probably turn up, but it would be hidden among 5,000 other novel/incomplete transcripts. This is all beside the point, because, if you know the gene you transformed, then why do RNAseq at all? A Q-PCR would give you a more accurate result at 1/100th the cost.
Sorry to shoot down all your ideas, I think your initiative is great. There is A LOT of sequencing data out there, I am sure you can find a set with some fruit still on it.
Maybe start thinking about questions that are important to you, personally. What organisms are important to you? What resources are available on the Sequence Read Archive? What question has yet to be answered about the organism?
If you want someone to bounce ideas off of, I'm happy to help.
I will take your suggestion to "practice science" and follow up on the NCBI SRA website and look up some raw data upload; and see what piques my interest.
A readily available data source I see on NCBI is genetic surveillance project of Mycobacterium tuberculosis around the world.
There are three separate metagenomics/sequencing project of M. tuberculosis samples taken place in Japan, Nepal and Malawi. http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&...
A very (vague) question I'd like to answer is: Are the SNPs identified in these separate geographical studies distinct or similar? Do different regions give way to distinct strains of TB? Do the metabolic pathways identified from these different strains of TB explain different viral adaptations to the host and its environment? I'm sure my posed questions are either already answered in literature, not posed properly and not feasible. But that'll be my starting point.
Much appreciated again!
Verification of results?
Is it unfair to assume they're having trouble paying rent at the Topos house?
Of course we're trying not to operate at a loss, but we're willing to accept that if necessary to accomplish the stated goals.
From the CFP I got the impression that the residency is hosted in the same place you all are currently living:
>live for four months in a 10,000 ft² mansion in San Francisco (of which the organizers are long-term residents)
Is that not the case?