This paper is just a first step - what we'd really like to use this for is designing recipes for synthesizing new molecules.
I would also be remiss if I didn't link to a closely-related paper from another group that came out at the same time: http://pubs.acs.org/doi/abs/10.1021/ci5006614
It looks a bit what you see in bad teaching materials, chemistry that is almost correct, but won't work well for some reason we are not telling the kids about. Please alleviate my concerns! :-)
We hope in the future to have access to real experimental data sets for real chemistry, complete with accurate temperatures, pressures, solvents, and reaction yields. Then our algorithm would be able to use all the information and predict reactions accurately. Right now, there just aren't many well-curated data sets with this kind of detail that would work for this kind of training. Happy to receive feedback from any experimental chemists out there with data from their research that they'd like to train on.
On the other hand, the key difficulty in synthetic chemistry, and the one that occupies the majority of a chemist's time is the identification of the correct reagent(s), the correct solvent, and the correct time, temperature, and concentration such that the desired reaction proceeds in a convenient amount of time and with the correct chemo- and regio-selectivity, that the reaction conditions are tolerated by the rest of the molecule, and that the product can be easily isolated from the reaction byproducts.
In my opinion, as long as these problems remain, then being able to turn retrosynthetic analysis over to a machine appears to me to provide little benefit.
I agree that reaction outcomes depend on many other factors besides the reagents. In the future, I'm sure we'll create reaction prediction frameworks that also take these other factors as inputs. The problem right now is that there aren't many datasets that include these extra factors.
We're not advocating turning retrosynthetic analysis over to machines yet. These are just baby steps.
How do you overcome this? Can you predict yield percentages of each product? What about chirality?
Can your system design synthesis pathways? Can it optimize for final product yield? How does it handle the thermodynamics and kinetics of reactions?
In any case, cool project. It's a very difficult domain.
(Accounting is the same way though for very different reasons. You could justify recording some transactions in about 5 different ways--but FASB says only a particular one is OK.)
Extending the system to (try to) predict yield percentages or chirality is straightforward. The hard part, in my mind, is that there aren't a fixed number of reaction types. As molecules get bigger, we'll have to move away from predicting one of a fixed set of reaction types, to directly predicting products - but this is a much harder problem.
Our system probably isn't good enough to design synthesis pathways yet, but that is the eventual goal. A system that also predicted yield would of course help with that, and that would be another straightforward extension.
I'm a grad student in computational chemistry. I am fascinated by the idea that our imagination, or limits of our chemical intuition, is the limiting factor for all kinds of cool advances. Through that I have recently been studying machine learning and I am interested in using it for catalyst optimization and design.
What is your opinion on the state of computationally assisted inverse design of molecules and the role of machine learning in it? The problem is a bit more open-ended compared to reaction optimization, but I could imagine that after proper formulation of the design guidelines the computer could help a lot.
I honestly don't know much about synthesis or the state of computationally assisted inverse design of molecules, except that it seems like a great idea, and that it is still early days. As you say, proper formulation of the guidelines is still necessary - as far as I understand, right now the synthesis people still use a lot of judgement at each step, and all these little choices will have to be recorded to build a useful dataset.
Manipulation of 2D-connectivity, as seen in the paper is new and not for humans.
We chose to use fingerprints, i.e. a vector representation of the graphical features of a molecule, for the inputs of our reactions, which is often used when machine learning properties of molecules, or for classifying the entire reaction as David pointed mentioned before. There's one paper from the Baldi group that uses inputs that are more like orbitals, and tries to predict the mechanisms of reactions directly : http://pubs.acs.org/doi/abs/10.1021/ci3003039
This is a good example of providing that kind of transparency.
"Reaction prediction remains one of the major challenges for organic chemistry and is a prerequisite for efficient synthetic planning. It is desirable to develop algorithms that, like humans, “learn” from being exposed to examples of the application of the rules of organic chemistry. We explore the use of neural networks for predicting reaction types, using a new reaction fingerprinting method. We combine this predictor with SMARTS transformations to build a system which, given a set of reagents and reactants, predicts the likely products. We test this method on problems from a popular organic chemistry textbook."
I am only afraid that the datasets you have used might not be of sufficiently quality for a neural network application. There are old recipes when the state of art in chemistry was at an earlier stage e.g. before the discovery of specific mechanisms, molecule classes, analytics and general concepts. Also, as mentioned in this thread, there are aspects of the synthetic chemists work and experience that might not be taken into consideration in this approach.
What actually makes it hard to predict a chemical reaction? Can't we empirically deduce them from quantum mechanics?
Another example would be protein folding. Even within the same "level" of chemistry, predicting the three-dimensional structure of a protein molecule based purely on the chemistry we understand and the protein sequence is a hard problem. We're getting better at it, but it's still hard.