Yes, you can easily use AutoModel.from_pretrained('bert-base-uncased') to convert some text into a vector of floats. What then?
What are the properties of downstream (aka actually useful) datasets that might make few-shot transfer difficult or easy? How much data do your users need to provide to get a useful classifier/tagger/etc. for their problem domain?
Why do seemingly-minor perturbations like typos or concating a few numbers result in major differences in representations, and how do you detect/test/mitigate this to ensure model behavior doesn't result weird downstream system behavior?
How do you train a dialog system to map 'I'm good, thanks' to 'no'? How do you train a sentiment classifier learn from contextual/pragmatic cues rather than purely lexical ones (example: 'I hate to say it but this product solves all my problems.' - positive or negative sentiment?)
How bad is the user experience of your Arabic-speaking customers compared to that of your English-speaking customers, and what can you do to measure this and fix it?
My linguistics background really helps me think through a lot of these 'applied' NLP problems. Knowing how to make matmuls fast on GPUs and knowing exactly how multihead self-attention works is definitely useful too, but that's only one piece of building systems with NLP components.