Given that unique and relevant data sets are often hard to come by (for obvious reasons) I'm wondering if there are any good rules of thumb to judge feasibility of different ideas.
Let me give you a concrete example; build a system that looks at medical records and approximate the risk for a certain illness. It is possible to fairly easy get overall data on the how common the illness is, which symptoms that are relevant and even if those symptoms are commonly recorded in medical records. But the granular data in the actual medical records are fairly hard to come by and would require a significant effort to collect. In this situation it would preferable to do some approximations on e.g. how many medical records are needed to get certain precision before pursuing the idea and start collecting data.
A less defined example would be; build an application that identifies if a picture contain a golden retriver dog with a red scarf around the neck. Also here relevant to have rough numbers on the number of data points needed etc (even if the actual data in this case is probably much easier to come by).
In the first case I could probably get ok approximations using statistics assuming normal distributions, but less straight forward in the second example.