The best papers to read are the T5 paper which introduced intstruction training.
BERT showed that training with two tasks (next sentence and mask fill) was more effective than solely one task.
T5 showed that multiple instructions could be used for one task (token prediction) like not just translating, but also summarizing. They suggested this could generalize (it did)
GPT-2 showed with just token prediction and no instructions you could represent good text; GPT-3 showed this was coherent and also that sufficient context was reliably continued by models(and impacted by the format of training data, e.g. StackOverflow used Q: A: in the training data, so prompts using Q: and A: worked very well for conversation-mimicking).
Davinci-instruct essentially made GPT-3 outputs reliable, because they "corrected model outputs" not just to follow the implicit continued context but to follow text instructions with general english in the users submitted prompt. They could change this to always follow a chat format (e.g. use Pronouns and refer to the user with "You") which seems to work more naturally, but the original instruct worked based on simple commands which are responded to without the chat format (e.g. no "I am sorry" - just no token, no "I believe the book you are looking for is:") etc.
Nowadays most instruct models do actually use prompt formats and training datasets which are conversational (check out the various formats in LM studio) anyway, so the difference is lost.