Show HN: Natural Language Processing Demystified (Part One) (opens in new tab)

(nlpdemystified.org)

166 pointsmothcamp4y ago42 comments

Hi HN:

I published part one of my free NLP course. The course is intended to help anyone who knows Python and a bit of math go from the very basics all the way to today's mainstream models and frameworks.

I strive to balance theory and practice and so every module consists of detailed explanations and slides along with a Colab notebook (in most modules) putting the theory into practice.

In part one, we cover text preprocessing, how to turn text into numbers, and multiple ways to classify and search text using "classical" approaches. And along the way, we'll pick up useful bits on how to use tools such as spaCy and scikit-learn.

No registration required: https://www.nlpdemystified.org/

42 comments

jll294y ago

NLP researcher here. It's great to see many offerings for courses and tutorials, and NLP has made a lot of progress, in terms of both its science as well as its re-usable software artifacts (ibraries & notebooks, standalone tools).

But what saddens me is too many people are trying to dive into NLP without trying to understand language & linguistics first. For example, you can run a part of speech (POS) tagger in three lines of Python, but you will still not know much about what parts of speech are, which languages have which ones, what function they have in linguistic theory or practical applications.

What are the advantages of using the C7 tagset over the C5 or PENN tagsets?

Why is AT sometimes called DET?

etc.

I recommend people spend a bit of time to read an(y) introduction to linguistics textbook before diving into NLP, then the second investment will be worth so much more.

mywaifuismeta4y ago

I'm generally not a fan of these kind of high-level tutorials that tell you "use X library to get Y result" - it's just not good for learning. But any content that tries to sell you on learning ML/NLP/etc in a few weeks is just that. I understand people want to make money by targeting a large audience, but it makes me sad when I see (the vast vast majority) of practitioners not having any understanding about ML (or NLP) and just blindly applying libraries.

I don't think you necessarily need a linguistics background for NLP, but I think you need either a strong linguistics OR ML background so that you know what's going on under the hood and can make connections. Anyone can call into Huggingface, you don't need a course for that.

scarface744y ago

Everything eventually gets boiled down to libraries. The purpose of technology is to get things done. I could say the same that it makes me sad that today’s developers use high level languages without ever knowing assembly. A chip designer could say that assembly language developers are saddened that the assembly language programmer never had to learn how processors are created.

mirker4y ago

It’s fine when the library is a tight abstraction. Unfortunately, ML libraries are leaky.

Example: take a classification model and change the output dimensions without understanding the model.

1 more reply

Der_Einzige4y ago

Doing non trivial things (more than .train or .generate) with huggingface models def requires tutorials or other resources, not sure what you're on about at all.

amitport4y ago

NLP is a vast field nowadays, you can solve a research problem with a novel transformer architecture (for example) without knowing anything about linguistics. The thing is that NLP is a vast field and there is plenty room to go around (same goes for vision, you don't really need classical vision background as much as you used to).

(also an NLP researcher. Knows nothing about linguistics)

screye4y ago

I makes sense to completely disregard language when looking at modern NLP solutions. In some sense, 'hand engineering' anything is looked down upon.

Transformers and scaling laws have made it such that the only thing that truly matters is your ability to build a model that can computationally and parametrically scale. The 2nd would be to figure out how to make more data 'viable' for usable within such a hungry model's encoding.

Look at anyone who has written the last 20 seminal papers in NLP, and almost none of them have a strong background in linguistics. Vision went through a similar period of forced obsolescence, during the 2012-2016 Alexnet -> VGG -> Inception -> Resnet transition.

It is unfortunate. But, time is limited and most researchers can only spare enough time to learn a few new things. Unfortunately for linguistics, it does not rank that high.

adamsmith1434y ago

I'm not at all sympathetic to this viewpoint. The Deep Learning revolution has shown us time and time again that Deep Learning experts universally outperform SME on modelling performance. I an almost 100% certain that the teams building big Transformers which are now by far the best NLP models (OpenAI, Meta, Google Brain, Deepmind, etc) are not made up of linguistic experts but Deep Learning experts.

amitport4y ago

These groups are not mutually exclusive.

adamsmith1434y ago

Maybe not but I'd guess that in this context the marginal gain for learning more about Linguistics is going to be dwarfed by learning more about Deep Learning.

sam_lowry_4y ago

In practice, they are, AFAIK.

philophyse4y ago

In your opinion, would George Yule's The Study of Language be a good introduction to linguistics? Or is there any other book that you would recommend to someone who has little knowledge of the field, but a lot of interest?

photonemitter4y ago

Jumping in on this; I’ve found jurafsky/martin a good place to start. Covers a lot of ground and is a pretty good read as well.

https://web.stanford.edu/~jurafsky/slp3/

ninjin4y ago

As a somewhat established researcher in the field, I second Jurasky and Martin. It is peerless and what I recommend to anyone joining my team if they think their background NLP knowledge is a bit on the weak side.

true_religion4y ago

I don’t know if anyone wants to dive into NLP as much as they just want to solve their problem at hand.

You are right that lack of fundamental knowledge is problematic, especially that tools can allow you to make a greater quantity of solutions and therefore also a greater quantity of mistakes.

However, at least the problem is still being solved.

For example, a few months ago I wanted to organize my media collection by tagging files with artist names. I had a list of artist names but it wasn’t comprehensive so I wired together a bunch of python NLP libraries together to automatically pull out proper nouns from filenames, recognize English names, then annotate the files.

I know almost nothing about parts of speech or anything else, so I made mistakes. About 10% of the results were errors in the first run, but after tuning it was down to about 1% which was good enough to run over the entire media library.

If not for the tools, I would have never been able to finish that chore in a single day. To me, it was worth it despite my amateur mistakes.

I view the library just like any other tool: a screw driver, a hammer, a wrench. I’m not a plumber or a carpenter, or an NLP researcher but I still want to use tools to fix my leaky faucets, remount my leaning cabinet doors, and organize my media collections as weekend projects.

LunaSea4y ago

Is this still true in an era where most NLP problems use language models as a solution?

gattilorenz4y ago

I think so. First of all, knowing some linguistics will teach you terms and concepts (e.g. parse tree, phrase, morpheme, phoneme, etc) that will both help you find relevant literature and avoid reinventing terms for stuff that is widely known (so others will more readily find your work).

Language models are currently the best solution for many problems, but it's hard to predict how we will move forward from here. Maybe the inclusion of linguistic information, or linguistic-inspired knowledge, or whatever, will be the key to having better results, or saving training time/resources. With no linguistics background, I imagine it's hard to get ideas going in that direction (and test if it's actually a good direction)

mothcampOP4y ago

I agree. I think having linguistics knowledge can help especially in applied situations. Linguistics knowledge can help create fallback systems when an ML system fails, or help build rules to amplify or dampen the confidence of a response from an ML system, or aid in the engineering of a system (all that comes before or after the ML blackbox).

Sort of like an algorithmic trader knowing market microstructure intimately (versus only pure statistics).

k8si4y ago

Language models as a solution to what problems?

Yes, you can easily use AutoModel.from_pretrained('bert-base-uncased') to convert some text into a vector of floats. What then?

What are the properties of downstream (aka actually useful) datasets that might make few-shot transfer difficult or easy? How much data do your users need to provide to get a useful classifier/tagger/etc. for their problem domain?

Why do seemingly-minor perturbations like typos or concating a few numbers result in major differences in representations, and how do you detect/test/mitigate this to ensure model behavior doesn't result weird downstream system behavior?

How do you train a dialog system to map 'I'm good, thanks' to 'no'? How do you train a sentiment classifier learn from contextual/pragmatic cues rather than purely lexical ones (example: 'I hate to say it but this product solves all my problems.' - positive or negative sentiment?)

How bad is the user experience of your Arabic-speaking customers compared to that of your English-speaking customers, and what can you do to measure this and fix it?

My linguistics background really helps me think through a lot of these 'applied' NLP problems. Knowing how to make matmuls fast on GPUs and knowing exactly how multihead self-attention works is definitely useful too, but that's only one piece of building systems with NLP components.

riku_iki4y ago

> My linguistics background really helps me think through a lot of these 'applied' NLP problems.

There many benchmarks where LMs absolutely outperform mechanical linguistics solutions.

Do you have success stories when there is significant outperforming solution in opposite direction?

1 more reply

xtiansimon4y ago

“I recommend people spend a bit of time to read an(y) introduction to linguistics textbook…”

Linguists is a broad area of study. Can you be more specific? Such as grammar and syntax.

ad404b8a372f2b94y ago

"Every time I fire a linguist, the performance of the speech recognizer goes up"

- Frederick Jelinek

vb2344y ago

Could you recommend a good introduction to NLP book?

meristem4y ago

Do you have specific book suggestions?

PainfullyNormal4y ago

> I recommend people spend a bit of time to read an(y) introduction to linguistics textbook

Do you have a favorite you can recommend?

sam_lowry_4y ago

Elements by Tesnière. I am not kidding, there is a shitload of knowledge there, largely forgotten by the time NLP merged with CompSci.

Jurafsky and Martin, Manning and Schutze are great books for computer scientists but these do not teach about the language.

1 more reply

mothcampOP4y ago

In addition to Jurafsky and Martin (https://web.stanford.edu/~jurafsky/slp3/), I also like Emily Bender's book: https://www.goodreads.com/book/show/18128399-linguistic-fund...

Bender's book is NOT an end-to-end text though imo. It's more a central jumping off point. So you can read about a concept and if it sounds interesting, search more about it.

rmellow4y ago

In addition to Jurafsky and Martin, I recommend Foundations of Statistical NLP by Manning and Schutze: https://nlp.stanford.edu/fsnlp/promo/

mothcampOP4y ago

Hi HN:

I published part one of my free NLP course. The course is intended to help anyone who knows Python and a bit of math go from the very basics all the way to today's mainstream models and frameworks.

I strive to balance theory and practice and so every module consists of detailed explanations and slides along with a Colab notebook (in most modules) putting the theory into practice.

No registration required: https://www.nlpdemystified.org/

irln4y ago

The interface is great. Did you create the front-end/back-end from scratch?

mothcampOP4y ago

Thank you. Yep. It's all statically-generated pages using Next.js with a single Next.js API route for the subscription. All hosted on Netlify.

jasfi4y ago

I'm working on extracting facts from sentences, see https://lxagi.com.

Which are the toughest NLP problems you know of that aren't being solved satisfactorily?

Der_Einzige4y ago

Queryable, word level, extractive summarization with grammatical correctness. AKA: what a human does when they are "highlighting" a document.

think extractive QA but the answer size should be configurable and the answer can potentially be multiple spans, and spans may not need to be contiguous.

If you got a solution, I'd love to see it - and you could even beat the baselines for the only dataset that exists for it: https://paperswithcode.com/sota/extractive-document-summariz...

jasfi4y ago

Thanks, I'll add that to the list of possible use cases, although that will take additional time. The solution won't be ready anytime soon, so please sign-up for the announcement list on the website if you're interested.

jasfi4y ago

Can I ask more about your interest in NLP? How can I contact you?

riku_iki4y ago

Actually, problem you are working on doesn't look like solved satisfactory yet :-)

jasfi4y ago

Thanks, good to know!

airstrike4y ago

Getting an invalid HTTPS certificate

jasfi4y ago

It works for me, which browser are you using? Can you see the certificate?

Utkarsh_Mood4y ago

Looks great, thanks for this!

j / k navigate · click thread line to collapse

42 comments

jll294y ago

What are the advantages of using the C7 tagset over the C5 or PENN tagsets?

Why is AT sometimes called DET?

etc.

I recommend people spend a bit of time to read an(y) introduction to linguistics textbook before diving into NLP, then the second investment will be worth so much more.

mywaifuismeta4y ago

scarface744y ago

mirker4y ago

It’s fine when the library is a tight abstraction. Unfortunately, ML libraries are leaky.

Example: take a classification model and change the output dimensions without understanding the model.

1 more reply

Der_Einzige4y ago

Doing non trivial things (more than .train or .generate) with huggingface models def requires tutorials or other resources, not sure what you're on about at all.

amitport4y ago

(also an NLP researcher. Knows nothing about linguistics)

screye4y ago

I makes sense to completely disregard language when looking at modern NLP solutions. In some sense, 'hand engineering' anything is looked down upon.

It is unfortunate. But, time is limited and most researchers can only spare enough time to learn a few new things. Unfortunately for linguistics, it does not rank that high.

adamsmith1434y ago

amitport4y ago

These groups are not mutually exclusive.

adamsmith1434y ago

Maybe not but I'd guess that in this context the marginal gain for learning more about Linguistics is going to be dwarfed by learning more about Deep Learning.

sam_lowry_4y ago

In practice, they are, AFAIK.

philophyse4y ago

photonemitter4y ago

Jumping in on this; I’ve found jurafsky/martin a good place to start. Covers a lot of ground and is a pretty good read as well.

https://web.stanford.edu/~jurafsky/slp3/

ninjin4y ago

true_religion4y ago

I don’t know if anyone wants to dive into NLP as much as they just want to solve their problem at hand.

You are right that lack of fundamental knowledge is problematic, especially that tools can allow you to make a greater quantity of solutions and therefore also a greater quantity of mistakes.

However, at least the problem is still being solved.

If not for the tools, I would have never been able to finish that chore in a single day. To me, it was worth it despite my amateur mistakes.

LunaSea4y ago

Is this still true in an era where most NLP problems use language models as a solution?

gattilorenz4y ago

mothcampOP4y ago

Sort of like an algorithmic trader knowing market microstructure intimately (versus only pure statistics).

k8si4y ago

Language models as a solution to what problems?

Yes, you can easily use AutoModel.from_pretrained('bert-base-uncased') to convert some text into a vector of floats. What then?

How bad is the user experience of your Arabic-speaking customers compared to that of your English-speaking customers, and what can you do to measure this and fix it?

riku_iki4y ago

> My linguistics background really helps me think through a lot of these 'applied' NLP problems.

There many benchmarks where LMs absolutely outperform mechanical linguistics solutions.

Do you have success stories when there is significant outperforming solution in opposite direction?

1 more reply

xtiansimon4y ago

“I recommend people spend a bit of time to read an(y) introduction to linguistics textbook…”

Linguists is a broad area of study. Can you be more specific? Such as grammar and syntax.

ad404b8a372f2b94y ago

"Every time I fire a linguist, the performance of the speech recognizer goes up"

- Frederick Jelinek

vb2344y ago

Could you recommend a good introduction to NLP book?

meristem4y ago

Do you have specific book suggestions?

PainfullyNormal4y ago

> I recommend people spend a bit of time to read an(y) introduction to linguistics textbook

Do you have a favorite you can recommend?

sam_lowry_4y ago

Elements by Tesnière. I am not kidding, there is a shitload of knowledge there, largely forgotten by the time NLP merged with CompSci.

Jurafsky and Martin, Manning and Schutze are great books for computer scientists but these do not teach about the language.

1 more reply

mothcampOP4y ago

In addition to Jurafsky and Martin (https://web.stanford.edu/~jurafsky/slp3/), I also like Emily Bender's book: https://www.goodreads.com/book/show/18128399-linguistic-fund...

Bender's book is NOT an end-to-end text though imo. It's more a central jumping off point. So you can read about a concept and if it sounds interesting, search more about it.

rmellow4y ago

In addition to Jurafsky and Martin, I recommend Foundations of Statistical NLP by Manning and Schutze: https://nlp.stanford.edu/fsnlp/promo/

mothcampOP4y ago

Hi HN:

I published part one of my free NLP course. The course is intended to help anyone who knows Python and a bit of math go from the very basics all the way to today's mainstream models and frameworks.

I strive to balance theory and practice and so every module consists of detailed explanations and slides along with a Colab notebook (in most modules) putting the theory into practice.

No registration required: https://www.nlpdemystified.org/

irln4y ago

The interface is great. Did you create the front-end/back-end from scratch?

mothcampOP4y ago

Thank you. Yep. It's all statically-generated pages using Next.js with a single Next.js API route for the subscription. All hosted on Netlify.

jasfi4y ago

I'm working on extracting facts from sentences, see https://lxagi.com.

Which are the toughest NLP problems you know of that aren't being solved satisfactorily?

Der_Einzige4y ago

Queryable, word level, extractive summarization with grammatical correctness. AKA: what a human does when they are "highlighting" a document.

think extractive QA but the answer size should be configurable and the answer can potentially be multiple spans, and spans may not need to be contiguous.

If you got a solution, I'd love to see it - and you could even beat the baselines for the only dataset that exists for it: https://paperswithcode.com/sota/extractive-document-summariz...

jasfi4y ago

Can I ask more about your interest in NLP? How can I contact you?

riku_iki4y ago

Actually, problem you are working on doesn't look like solved satisfactory yet :-)

jasfi4y ago

Thanks, good to know!

airstrike4y ago

Getting an invalid HTTPS certificate

jasfi4y ago

It works for me, which browser are you using? Can you see the certificate?

Utkarsh_Mood4y ago

Looks great, thanks for this!

j / k navigate · click thread line to collapse