A couple general tips :
1. For second-wave bioinformatic analysis, stick to organisms that have a (semi) decent reference sequence. It removes untold among of noise from your eventual analysis. M. tuberculosis, for example is a good candidate as it has an excellent reference.
2. Human diseases always have tons of data!
>Are the SNPs identified in these separate geographical studies distinct or similar?
Human pathogens almost always show strong population structure that corealtes with human migration.
That means, distincit old world/new world/oceania varieties characterized by differing patterns in variation. They also show recent population bottlenecks that correspond to the out-of-africa migration, and the columbian expansion. (http://www.nature.com/ng/journal/v45/n10/full/ng.2744.html)
These kind of studies usually sequence a bunch of individuals, and then run RaXML to make a maximum likelihood tree, and run a principle component analysis to further characterize the population structure. This is usually followed by some first-pass selection scans to identify genes of interest.
The problem is that this is all "low hanging fruit", and it is always picked by the people who did (and paid for) the sequencing.
> Do the metabolic pathways identified from these different strains of TB explain different viral adaptations to the host and its environment?
Metabolism is an essential feature to fitness, it rarely changes at the population level. If metabolic genes do change, they are usually secondary metabolism. Central metabolism is extremely conserved.
In human pathogens the things evolving are (in order of magnitude) :
1. Resistance to antibiotics/antivirals/antiparasitic. This is huge. It creates MASSIVE selective sweeps.
2. Host/pathogen interface. Classic red-queen hypothesis. Antigen genes are rapidly evolving to evade host defence and maintain ability for invasion.
3. Trasmission. Virulence factors which enhance the fitness of the pathogen.
> I'm sure my posed questions are either already answered in literature, not posed properly and not feasible. But that'll be my starting point.
I think, with a second-wave analysis. You need to gear your questions according to the datasets you can get your hands on.
Start by picking an organism. Then research how many individual genomes from that species you can get. Then design an experiment around that.
A classic "second wave" bioinformatic analysis uses two reference genomes, and a sequenced population of individuals from one of the two species. The experiment performs a simple statistical test to measure the accrual of non-synonymous substitutions within a species, and compares it to the accural of NS substitutions between species. Using these rates you can infer evolutionary adaptation/neutrality/conservation. It's called the McDonald-Kreitman test, here is a web implementation for a single-gene example http://mkt.uab.es/mkt/help_mkt.asp <-- try doing this by hand by pulling protein sequences from online databases.
It's a simple algorithm, that can easily be scaled to a genome wide scan for adaptation (followed by an FDR correction for multiple testing!!!!!!).
The experiment for M. tuberculosis would go like this :
1. Either find an existing set of ortholog predictions between M. tuberculosis compared some other near species with a reference genome (preferably same-genus), or calculate them yourself with orthoMCL. Filter this to exclude everything BUT one-to-one orthologs. You can only run the MK test on one-to-one orthologs.
2. Extract out the protein sequence (gene->mRNA->translated protein) for each annotated gene in each of the two reference files. A GFF file (gene feature file) should be available for each reference telling you the coordinates of each gene so you can get the exons and translate them to make a protein. If there isn't a gff, you can predict genes using genome annotation software. Match protein from genome A, to its ortholog in genome B, using the data from step 1, and put the two sequences into a single fasta file. Do this for all single copy genes.
2. Pull a population of 10 or so individual M. tuberculosis genomes from SRA. Align the reads to the M. tuberculosis reference. Pull out proteins like before. Match them to the ortholog sequences from step 1 and 2, so everything in all in one file.
3. Take your fasta file which contains M. tuberculosis reference, second reference, and the 10 or so individuals and align each protein file using Muscle (an alignment program, look it up).
4. Implement a MK test that can run on your aligned fasta file. Spot check it against the online single-gene implementation I linked.
5. Take your resulting p-values and perform an FDR correction to account for testing thousands of genes within the genome.
6. Calculate alpha (its on the MKT wiki article), and determine which of your significant gens have a positive value of alpha (because positive alpha means more within-species NS substituions == positive selection == interesting).
7. Congrats, you have done the science and learned about whats changing within the M. tuberculosis species.