I published part one of my free NLP course. The course is intended to help anyone who knows Python and a bit of math go from the very basics all the way to today's mainstream models and frameworks.
I strive to balance theory and practice and so every module consists of detailed explanations and slides along with a Colab notebook (in most modules) putting the theory into practice.
In part one, we cover text preprocessing, how to turn text into numbers, and multiple ways to classify and search text using "classical" approaches. And along the way, we'll pick up useful bits on how to use tools such as spaCy and scikit-learn.
No registration required: https://www.nlpdemystified.org/
But what saddens me is too many people are trying to dive into NLP without trying to understand language & linguistics first. For example, you can run a part of speech (POS) tagger in three lines of Python, but you will still not know much about what parts of speech are, which languages have which ones, what function they have in linguistic theory or practical applications.
What are the advantages of using the C7 tagset over the C5 or PENN tagsets?
Why is AT sometimes called DET?
etc.
I recommend people spend a bit of time to read an(y) introduction to linguistics textbook before diving into NLP, then the second investment will be worth so much more.
I don't think you necessarily need a linguistics background for NLP, but I think you need either a strong linguistics OR ML background so that you know what's going on under the hood and can make connections. Anyone can call into Huggingface, you don't need a course for that.
(also an NLP researcher. Knows nothing about linguistics)
Transformers and scaling laws have made it such that the only thing that truly matters is your ability to build a model that can computationally and parametrically scale. The 2nd would be to figure out how to make more data 'viable' for usable within such a hungry model's encoding.
Look at anyone who has written the last 20 seminal papers in NLP, and almost none of them have a strong background in linguistics. Vision went through a similar period of forced obsolescence, during the 2012-2016 Alexnet -> VGG -> Inception -> Resnet transition.
It is unfortunate. But, time is limited and most researchers can only spare enough time to learn a few new things. Unfortunately for linguistics, it does not rank that high.
You are right that lack of fundamental knowledge is problematic, especially that tools can allow you to make a greater quantity of solutions and therefore also a greater quantity of mistakes.
However, at least the problem is still being solved.
For example, a few months ago I wanted to organize my media collection by tagging files with artist names. I had a list of artist names but it wasn’t comprehensive so I wired together a bunch of python NLP libraries together to automatically pull out proper nouns from filenames, recognize English names, then annotate the files.
I know almost nothing about parts of speech or anything else, so I made mistakes. About 10% of the results were errors in the first run, but after tuning it was down to about 1% which was good enough to run over the entire media library.
If not for the tools, I would have never been able to finish that chore in a single day. To me, it was worth it despite my amateur mistakes.
I view the library just like any other tool: a screw driver, a hammer, a wrench. I’m not a plumber or a carpenter, or an NLP researcher but I still want to use tools to fix my leaky faucets, remount my leaning cabinet doors, and organize my media collections as weekend projects.
Language models are currently the best solution for many problems, but it's hard to predict how we will move forward from here. Maybe the inclusion of linguistic information, or linguistic-inspired knowledge, or whatever, will be the key to having better results, or saving training time/resources. With no linguistics background, I imagine it's hard to get ideas going in that direction (and test if it's actually a good direction)
Yes, you can easily use AutoModel.from_pretrained('bert-base-uncased') to convert some text into a vector of floats. What then?
What are the properties of downstream (aka actually useful) datasets that might make few-shot transfer difficult or easy? How much data do your users need to provide to get a useful classifier/tagger/etc. for their problem domain?
Why do seemingly-minor perturbations like typos or concating a few numbers result in major differences in representations, and how do you detect/test/mitigate this to ensure model behavior doesn't result weird downstream system behavior?
How do you train a dialog system to map 'I'm good, thanks' to 'no'? How do you train a sentiment classifier learn from contextual/pragmatic cues rather than purely lexical ones (example: 'I hate to say it but this product solves all my problems.' - positive or negative sentiment?)
How bad is the user experience of your Arabic-speaking customers compared to that of your English-speaking customers, and what can you do to measure this and fix it?
My linguistics background really helps me think through a lot of these 'applied' NLP problems. Knowing how to make matmuls fast on GPUs and knowing exactly how multihead self-attention works is definitely useful too, but that's only one piece of building systems with NLP components.
Linguists is a broad area of study. Can you be more specific? Such as grammar and syntax.
- Frederick Jelinek
Do you have a favorite you can recommend?
Jurafsky and Martin, Manning and Schutze are great books for computer scientists but these do not teach about the language.
Bender's book is NOT an end-to-end text though imo. It's more a central jumping off point. So you can read about a concept and if it sounds interesting, search more about it.
I published part one of my free NLP course. The course is intended to help anyone who knows Python and a bit of math go from the very basics all the way to today's mainstream models and frameworks.
I strive to balance theory and practice and so every module consists of detailed explanations and slides along with a Colab notebook (in most modules) putting the theory into practice.
In part one, we cover text preprocessing, how to turn text into numbers, and multiple ways to classify and search text using "classical" approaches. And along the way, we'll pick up useful bits on how to use tools such as spaCy and scikit-learn.
No registration required: https://www.nlpdemystified.org/
Which are the toughest NLP problems you know of that aren't being solved satisfactorily?
think extractive QA but the answer size should be configurable and the answer can potentially be multiple spans, and spans may not need to be contiguous.
If you got a solution, I'd love to see it - and you could even beat the baselines for the only dataset that exists for it: https://paperswithcode.com/sota/extractive-document-summariz...