A collaborative residency program in mathematical biology and deep learning (opens in new tab)

(topos.house)

52 pointsszany10y ago11 comments

11 comments

Hi, I unfortunately cannot apply to a FT residency program as I have a FT job (as I suspect most people here do).

However, I do have an interest in Bioinformatics/Computational Biology. Unfortunately, I find that most textbooks that cover these topics are very theoretical, e.g., going over how Blast algorithm works, explanations of NGS assembly and annotation.

I'm curious any peeps in the field have any simple learning projects/practical papers for a programmer interested in Bioinformatics?

Things I'm thinking of are, (would like to hear from the experts any hints, tips or links):

1) Downloading a GWAS (Genome Wide Association) data-set for a psychiatric or cancer disease and replicating the statistical techniques that scientists used to determine the clusters of responsible genes (e.g., http://www.nature.com/nature/journal/v511/n7510/full/nature1...)

2) Downloading different strains of Cannabis Indica and Sativa genome draft assemblies; running initial phylenlogy tree and run genome annotation pipelines to identify known genes; and using RNASeq data to isolate the potencies of different varieties of Cannabis and metabolic pathways to produce hemp (e.g., http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3359589/)

3) Downloading data-set of RNASeq data from synthetic biology experiments involving yeast and replicating the statistical technique that authors used to determine the gene regulators/promoters used to program the cells to produce plastic, fuel and drugs for them! (http://www.nature.com/nature/journal/v440/n7086/full/nature0...)

joe_the_user10y ago

I think Bioinformatics is a challenging field because, as the residence alludes to, it is aiming to marry different fields that don't necessarily talk to each other. Lior Pachter has blogged about this [1]. Pachter's book, Algebraic Statistics For Computational Biology is fairly accessible.

[1] https://liorpachter.wordpress.com/2014/12/30/the-two-culture...

szanyOP10y ago

This isn't full-time — one of the goals is explicitly "to complement rather than conflict with the tenure of a full-time position elsewhere".

searine10y ago

>Things I'm thinking of are, (would like to hear from the experts any hints, tips or links):

I mean, if you just want to replicate what they did, that's fine, but it kind of misses the point of it all. Science is about finding something new.

Why spin your wheels redoing someone else's work, when you can push into new territory? Science is all about having a question, and then just "learn by doing" until you find an answer.

>Things I'm thinking of are

GWAS data is messy, and usually has already been picked clean by the time it has been published. They do get pretty sophisticated with the stats analysis though.

The Cannabis reference sequence is awful (I should know, I was the first person to make an assembly). Plants have really really complex genomes, and there isn't the research interest to improve the references. This, combined with the fact there is almost zero available RNAseq data would make this project difficult. Yes, there is some RNAseq data, but to separate signal from noise you'd need several technical replicates, which we don't have. As for phylogeny, the macro phylogeny is pretty well known from 16s sequence, and we don't have enough individuals to tease out interesting population structure. Genome annotation, in the absence of molecular work, is always boring because it can only find what is already known from other organisms.

Most all of that experiment you linked was molecular work. They PCR'd up some genes and transformed them into yeast. They didn't use stats to identify genes because the genes were already well known and sequenced. After that it was just a matter of putting the genes in front of a high output promoter and thats that. If you had RNAseq data for that experiment, which we don't, you wouldn't be able to see the difference in regulation because the transgene isn't part of the yeast reference genome. You could do a de novo transcriptome assembly, and then it would probably turn up, but it would be hidden among 5,000 other novel/incomplete transcripts. This is all beside the point, because, if you know the gene you transformed, then why do RNAseq at all? A Q-PCR would give you a more accurate result at 1/100th the cost.

Sorry to shoot down all your ideas, I think your initiative is great. There is A LOT of sequencing data out there, I am sure you can find a set with some fruit still on it.

Maybe start thinking about questions that are important to you, personally. What organisms are important to you? What resources are available on the Sequence Read Archive? What question has yet to be answered about the organism?

If you want someone to bounce ideas off of, I'm happy to help.

noname12310y ago

Thank you searine for your thoughtful and helpful reply to my post.

I will take your suggestion to "practice science" and follow up on the NCBI SRA website and look up some raw data upload; and see what piques my interest.

A readily available data source I see on NCBI is genetic surveillance project of Mycobacterium tuberculosis around the world.

There are three separate metagenomics/sequencing project of M. tuberculosis samples taken place in Japan, Nepal and Malawi. http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&...

A very (vague) question I'd like to answer is: Are the SNPs identified in these separate geographical studies distinct or similar? Do different regions give way to distinct strains of TB? Do the metabolic pathways identified from these different strains of TB explain different viral adaptations to the host and its environment? I'm sure my posed questions are either already answered in literature, not posed properly and not feasible. But that'll be my starting point.

Much appreciated again!

1 more reply

heurist10y ago

> Why spin your wheels redoing someone else's work, when you can push into new territory?

Verification of results?

iskander10y ago

It seems a little unusual to see a "residency" which makes you pay for housing ($2400 / month).

Is it unfair to assume they're having trouble paying rent at the Topos house?

szanyOP10y ago

Thanks for raising that question. Actually the house was rented for the purpose of hosting the program, not the other way around.

Of course we're trying not to operate at a loss, but we're willing to accept that if necessary to accomplish the stated goals.

iskander10y ago

>Actually the house was rented for the purpose of hosting the program, not the other way around.

From the CFP I got the impression that the residency is hosted in the same place you all are currently living:

>live for four months in a 10,000 ft² mansion in San Francisco (of which the organizers are long-term residents)

Is that not the case?

1 more reply

j / k navigate · click thread line to collapse

11 comments

noname12310y ago

Hi, I unfortunately cannot apply to a FT residency program as I have a FT job (as I suspect most people here do).

I'm curious any peeps in the field have any simple learning projects/practical papers for a programmer interested in Bioinformatics?

Things I'm thinking of are, (would like to hear from the experts any hints, tips or links):

joe_the_user10y ago

[1] https://liorpachter.wordpress.com/2014/12/30/the-two-culture...

szanyOP10y ago

This isn't full-time — one of the goals is explicitly "to complement rather than conflict with the tenure of a full-time position elsewhere".

searine10y ago

>Things I'm thinking of are, (would like to hear from the experts any hints, tips or links):

I mean, if you just want to replicate what they did, that's fine, but it kind of misses the point of it all. Science is about finding something new.

Why spin your wheels redoing someone else's work, when you can push into new territory? Science is all about having a question, and then just "learn by doing" until you find an answer.

>Things I'm thinking of are

GWAS data is messy, and usually has already been picked clean by the time it has been published. They do get pretty sophisticated with the stats analysis though.

Sorry to shoot down all your ideas, I think your initiative is great. There is A LOT of sequencing data out there, I am sure you can find a set with some fruit still on it.

If you want someone to bounce ideas off of, I'm happy to help.

noname12310y ago

Thank you searine for your thoughtful and helpful reply to my post.

I will take your suggestion to "practice science" and follow up on the NCBI SRA website and look up some raw data upload; and see what piques my interest.

A readily available data source I see on NCBI is genetic surveillance project of Mycobacterium tuberculosis around the world.

There are three separate metagenomics/sequencing project of M. tuberculosis samples taken place in Japan, Nepal and Malawi. http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&...

Much appreciated again!

1 more reply

heurist10y ago

> Why spin your wheels redoing someone else's work, when you can push into new territory?

Verification of results?

iskander10y ago

It seems a little unusual to see a "residency" which makes you pay for housing ($2400 / month).

Is it unfair to assume they're having trouble paying rent at the Topos house?

szanyOP10y ago

Thanks for raising that question. Actually the house was rented for the purpose of hosting the program, not the other way around.

Of course we're trying not to operate at a loss, but we're willing to accept that if necessary to accomplish the stated goals.

iskander10y ago

>Actually the house was rented for the purpose of hosting the program, not the other way around.

From the CFP I got the impression that the residency is hosted in the same place you all are currently living:

>live for four months in a 10,000 ft² mansion in San Francisco (of which the organizers are long-term residents)

Is that not the case?

1 more reply

j / k navigate · click thread line to collapse