Ask HN: What are the best coding practices in research projects?

30 pointsmlajszczak7y ago13 comments

Research environment differs significantly from regular software engineering:

- researchers conduct experiments that may fail

- they often produce PoC instead of final products

- it's tempting to produce tonnes of low quality code (since it just an experiment/PoC!)

- often researchers are not software engineers so they don't really care about code quality / tests

How to find a good trade-off between high coding standards and not getting in the way of research? Is it possible to move smoothly from PoC to production solution without rewriting everything from scratch? How to share code between experiments / PoCs?

13 comments

electricslpnsld7y ago

Used to be a research scientist at a ~3k person Bay Area company. Early versions of a project typically existed outside of the mainline code branch at the company (we usually just tossed it into the company's private gitlab). We tried to keep things reasonably organized and conforming to the company's standards, but as paper deadlines approached the code usually devolved into a mess of technical debt (just like grad school!). After we knew we had some new algorithm or technique that worked, we typically went back and stripped away the cruft, cleaned up the code so it conformed with the company's coding standards, added required tests, pushed to the company's main branch, and began the really hard work: convincing people in the company that what we did was useful for their work to drive adoption.

Addressing the specific points:

> - researchers conduct experiments that may fail

Very true! Often many, many, many experiments...

> - they often produce PoC instead of final products

This is going to vary a lot depending on where you are working. We were responsible for developing, deploying, and for some time supporting whatever we developed.

> - it's tempting to produce tonnes of low quality code (since it just an experiment/PoC!)

Yep, especially in early stages.

> - often researchers are not software engineers so they don't really care about code quality / tests

This really depends on what stage in the lifecycle of a research project we were in. We were responsible for deploying the final code, so at the end of the day it had to be of the same quality as something someone with the title of software engineer would generate.

cbanek7y ago

I think the first two are probably the right kinds of problems. Even in normal software engineering, you want to try out PoC's to prove that they are what your customer wants, and that things generally work the way you think they should.

Code quality is another problem entirely. I agree code quality can get out of control as soon as the PoC is promoted to something resembling "production."

My suggestions are:

- First, if you get frustrated at researchers for code quality, let them calmly know why you are upset. If they are being inefficient, many would love to hear tips to keep it from happening. Let them know when the things they are doing might affect large groups of people.

- Don't try to write tests for everything. This just slows you down, getting away from the good things above. Write tests for things that are frequently broken, and absolutely required to work, such as core functionality. If something gets broken 2 or 3 times, you should definitely have a test.

- Make your tests as high level as possible. Compute power is cheap, and despite what you might hear from the TDD/unit testing crowd, your tests don't need to run in 2 seconds to be useful. I like to have tests that emulate users, because as you change the logic of how you're doing things, you still have tests to back you up.

- Add lots of additional logging. This helps document the code (since the messages should be useful and say what is going on), and provides great info for debugging issues after they've already occurred. I've been saved by good logging more times than I can remember, especially on different OS/environments that aren't the test environment.

- Don't worry too much about edge cases. Just print a log line or crash out if it's something ridiculous you've gotten yourself into, which is a lot more friendly than figuring out some horrendous bug mired in retry logic that has masked the original issue.

- Insist on version control, but not code reviews. Code reviews can really slow you down. Instead, fix problems after they come up. You haven't shipped, right?

- Run the build and tests in a simple CI loop that runs overnight. Don't worry about testing each commit, just know if it works or doesn't work. Fix the problems.

These last two are related:

- Feel free to just start over. Delete huge amounts of code, and try a different approach.

- If you have gone past the point of no return (you don't want to start over), then start production-izing the code. Again, aim at the problems to start, not some coverage metric. Look over all the code and reduce redundancy. It's a lot easier to review code once it's all there, rather than bit by bit.

irvingprime7y ago

I used to be a software lead in a research organization. I have a LOT of opinions on the subject but I'll give you just a few points.

- I've seen many cases where researchers refused to share their code because they knew it wasn't up to any reasonable standard. This is a red flag. If they are embarrassed by their code, I tend to discount their alleged results entirely.

- Even in research, people should be required by the organization to follow some kind of process. Use version control (like git or even svn. This basic step is still not universal), put in pull requests, get code reviewed from someone else.

- For that purpose, every research organization should have someone on staff who can do a competent review. They do not need to specialize in the researcher's field. They just need to know a code smell when they see it.

- Every researcher I have known will resist this strenuously. That is a sign of how much they need it.

- When publishing research results, code and data should always be required. Otherwise, the results cannot be judged. (A lot of people like it that way. They should not be accommodated).

I could go on but I'll be nice and stop here.

AnimalMuppet7y ago

To expand on your first and last points:

Your researcher got a result. Great. What is their objective evidence that the result is real rather than an artifact of a bug in their code? If the code is garbage, you can't trust the result, no matter how much of a breakthrough the result would be if true.

That doesn't mean that the code needs to be production-ready. It does mean that the code needs to be clean enough to be trustworthy. (Tests can be included in this evaluation.)

If the code's going to be product-ized... maybe ask the researcher which parts of the code they think are the most troublesome. Start by re-writing those pieces, from scratch, with production levels of rigor. Then, as other parts prove troublesome, rewrite those too. Don't band-aid them, rewrite them. Keep the interfaces, unless the interface itself is part of the problem.

indescions_20187y ago

JupyterHub allows you setup research clusters on GCloud, AWS and Azure. You can set CPU / GPU resource utilization limits, disk usage, memory, network. Even limit scaling to your budget. Once your experiment is up and running. It's simply another service running in a container. Have used it for a small distributed team. But can be scaled to corporate R&D teams with 1000s.

Core environment is still the Jupyter Notebook. So should remain familiar to most data scientists.

Zero to JupyterHub with Kubernetes

https://zero-to-jupyterhub.readthedocs.io/en/latest/

hprotagonist7y ago

It's a free-for-all. (welcome to my world :( )

Something small but meaningful that I believe in are tools like versoneer (https://github.com/warner/python-versioneer/) which bump the version of your code on every commit.

Then, embed this version string in all output. Figures, serialized data, whatever.

It is very powerful to be able to point at a figure and say "this graph was produced by precisely this code". If you're feeling particularly anal, include the hashes of the datasets that generated it too.

throwawayjava7y ago

There is no one answer. The answer will change drastically depending on the sort of research you're doing, and your role in that research.

Are you a mathematician simulating a dynamical system? A theoretical computer scientist exploring the effects of parameters that are difficult to nail down analytically?

Are you a computer scientist working on a new sort of system? Is the point of that system to support a long-running research agenda, or to demonstrate the feasibility of a general notion/idea?

Or are you a software engineer supporting a natural scientist (e.g., in a large bio/neuro/chem/physics lab)?

Are you the PhD student, the research scientist, the supporting engineer, or the PI?

But in any case, the correct answer will start with interrogating the purpose/role of the software in your research project. And that answer could range from "hack out the MATLAB and sanity check" all the way to "lives are on the line; practice extreme rigor". And certainly not excluding "convince your funding agency/PI that it's time to hire a professional"!

mipmap047y ago

I've been lead on a few research projects. I found it useful to require that all work be tied to work items in our project management tool. The work item would need to describe the purpose and hypothesis of the work and capture summary results. We then created a file structure that matched to our work item IDs for all work product created while working on that task. Towards the end of the engagements, we go through and find what's relevant and include those artifacts in our report with steps for reproduceability, assumptions, and other clarifying points that may be salient.

We also made good use of tagging features in our project management toolset to make report writing easier at the end of the project.

debacle7y ago

From my experience in knowing a few scientists, the pattern seems to be:

- Scientists try their best to be good programmers, but are scientists first.

- Someone the scientist knows, or someone on the team with more programming knowledge, turns what the scientist produced into something maintainable at some point.

- If they're lucky, the grant will have resources for a script/software maintainer.

Scientists are scientists. I know a few who can do things with awk that probably should never be done, but they use the tools they know to get the data to look the way they need.

fundamental7y ago

In an ideal world you start off with loose highly manual code/processes when validating an idea and as you get closer to a publication components are restructured/rewritten to make it easy to replicate results.

Other researchers care about the final code which is used to generate the results. So, in my book it's ok if there's a large gap between the code that lead to the initial idea and the code which was used to show the idea in practice (i.e. the code used to generate all graphs/tables within a submitted publication).

quickthrower27y ago

> How to find a good trade-off between high coding standards and not getting in the way of research?

I think you can have both. Unit tests and good coding practices should make you faster once you have more than "1 screen"'s worth of code and you are relying on human memory when navigating and maintaining code.

I'm not a "unit test all the things" kind of person though.

tripn7y ago

researcher should NOT care anything related to "best coding practices", that is engineer's job. if you organization is very cheap, try to put both "researcher" and "software engineer" hat on same person, unless he/she wants to be both, otherwise this organization is asking for cheap research result, very simple, always true.

jononor7y ago

Basics

- use version control

- write tests (high level, keep it simple)

- pull in data as if it was a dependency, versioner and stable

- use a CI server/service

- when publishing, code and data goes with the paper

j / k navigate · click thread line to collapse