But also ... what do you need to know to recognize that the concept of a 'toxicity classifier' is likely broken? We can do _profanity_ detection pretty well, and without a huge amount of data. But with 1000 example comments, can you actually get at 'toxicity'? Can you judge toxicity purely from a comment in isolation, or does it need to be considered in the context in which that comment is made?
Maybe you don't need to know about python, but if you're building this, you should probably have spent some time thinking and grappling with ML problems in context, right? You want to know that, for example, the pipeline copilot is suggesting (word counts, TFIDF, naive Bayes) doesn't understand word order? Or to wonder whether it's tokenizing on just whitespace, and whether `'eat sh!t'` will fail to get flagged b/c `'shit'` and `'sh!t'` are literally orthogonal to the model?
More people should be able to create digital stuff that _does_ things, and maybe copilot is a tool to help us move in that direction. Great! But writing a bad "toxicity classifier" by not really engaging with the problem or thinking about how the solution works and where it fails seems potentially net harmful. More people should be able to make physical stuff too, but 3d-printed high-capacity magazines don't really get most of us where we want to go.
See, this tells me you may not have even used copilot. Because while tutorials such as this (and the OpenAI codex tools) have you use comments explicitly to code, the reality is that you're not hammering out plain english requirements for copilot to work. You just code - and sometimes it finishes your thought, sometimes it doesn't. You hit tab to accept autocomplete, just like you would for any other autocomplete. So you are generally reading and evaluating what copilot thinks is a good output and choosing whether it goes in the program or not with the TAB key.
The leading question is this:
>But as helpful as it is for coders, what if it enabled non-engineers to program too – by merely talking to an AI about their goals?
and it answers this in my opinion deceptively by presenting what amounts to a parlor trick. Whether copilot in general is any good or not is in my mind totally separate to this.
Just because one uses one of the ways doesn't mean they are not aware of the other way too.
This part from the article made me chuckle, because IMO the author fell for some of the most basic language processing smoke & mirrors:
…so we’ll give it some examples. When generating the array, it even creates the ideal variable name and escapes the quotations.
Here, it generates toxic_comments as a variable name, when the instructions were: # create an array with the following toxic comments: [etc]
This is pretty basic language parsing stuff that might have been kicking around awhile. I think the most basic english language parser could output something along the lines of what was suggested, given an understanding of what valid Python should look like. While impressive, it's not nearly as interesting or good as the rest of the work being done.Copilot appears no different to most ML models out there. Poor and incomplete training data will yield ok results for popular things but as soon as you ask for edge cases it will fall apart like Siri trying to understand a Scottish accent.
Eventually it might get there with enough good representative training data but it's unclear to me how long that will take. If it tracks with speech processing models it might take decades plus.
Another consideration is that because the training data is being done using github public repos (at least last I read), it's likely that it's ripe for abuse. If that's still how they're doing it I'm looking forward to the TEDTalk in two years from a researcher who "hacked" the copilot AI by polluting its training data.
OK, I am waiting for you to propose a basic language parser that can do it. There's a reason we're only now having this debate - it was unconceivable 5 years ago, in the era of basic language parsers.
One newspaper was left leaning, the other had the reputation of right wing trolls commenting and one was somewhat in the middle ground with a reputation of the audience being pseudo intellectual neoliberalists.
The 'center' (most typical) comment for these three sites totally was in line with these sentiments. The perfect proof (or confirmation bias).
But the classification didn't work. While there were clear cut cases (one has to love stereotypes) most cases were just neutral. Meaning they could have been made on any of these media sites. Either they were just too short or just not extreme enough.
I feel (used explicitly here) that toxicity is not something that is easily classifiable without deeper understanding of the context. Else, if feeling a comment was toxic was the measure one would need to query all walks of life from extreme left to extreme right and afterwards would probably be left with a lot of toxicity that doesn't tell us much except that different people will find different things toxic.
From simple letter substitution (sh!t) to completely different words/concepts (unalive) to “layer 2 sarcasm” (where someone adopts the persona of someone who supports the word view that’s against what they believe in a non-obvious attempt to rally people against that persona).
People have been getting away with being toxic in public for a long time. ML cannot keep up. Humans can’t even keep up.
Similarly, a lot of the training data/features ML engineers use ignore context -- for example, a Reddit comment may seem hateful in isolation, until you realize the subreddit it's in changes the meaning entirely (https://www.surgehq.ai/blog/why-context-aware-datasets-are-c...).
Regarding your point, we actually do a lot of "adversarial labeling" to try to make ML models robust to countermeasures (e.g., making sure that the ML models train on word letter substitutions), but it's pretty tricky!
I'm now imagining a very frustrated junior developer a few years from now trying to argue with Copilot to write code for a classifier for chemical compounds, but it just spits out code for classifying text.
That's ages ago in AI time.
So just the future version of a junior developer not knowing how to use their tools? Yeah, that scans. Still sounds incredibly useful however. The alternative is, of course, a junior developer fumbling as they try to write said program entirely from their learned skills and experience.
There’s not always a quick fix or easy path. You can’t always patch existing stuff together or just wait until the problem goes away.
And when a tool helps you too much, then is there really a point in what you’re doing? It’s not even a learning experience anymore.
You still need to be able to code and understand what you're doing. You can't just ask simple questions and get complex answers. You still have to be capable of asking complex questions.
A common scenario I can think if is where I struggle to remember the name or API of the exact thing I want to do but I know exactly how it works - typing that in and getting a result would improve my workflow, but it's just saving a trip to Google, we're not talking the difference between doing and not doing, just a saving a minute.
I would rate the value of this more as interesting rather than useful, simply because as another commenter highlighted it's just easier to write code. It could be useful incrementally but not for everything.
Copilot adds tremendous value for someone who knows what they want, but not how to do it.
For example, I'm not a great programmer. I'm also a lazy programmer. I had to convert a time to a specific format, in a specific timezone in JS, and I couldn't be bothered looking up documentation for Date.toLocaleTimeString (or is that Date.toLocaleString?).
I wrote a comment outlining exactly what I wanted: // given a date in ISO format (and UTC timezone), return the time in hh:mm AM/PM format (and x timezone) and immediately Copilot generated the code I was after.
Making something easier can definitely mean the difference between doing and not doing — I've taken on a lot of projects I wouldn't have attempted without Copilot.
How do you know it was what you were after? Like you said, it could be .toLocaleTimeString or .toLocaleString (or something else).
How do you verify that the AI isn't giving you broken/incorrect code? I guess you could check the docs, or run the code yourself, but at that point what's the value add for copilot?
"Forget all that. Judged against where AI was 20-25 years ago, when I was a student, a dog is now holding meaningful conversations in English. And people are complaining that the dog isn’t a very eloquent orator, that it often makes grammatical errors and has to start again, that it took heroic effort to train it, and that it’s unclear how much the dog really understands."
Technology that can only make you go ooh and ahh is pretty useless.
And I know it sounds silly and like "I had an idea like that once" (see Office Space), but I actually came up with the idea for or at least a similar one to Copilot in an off comment to a coworker back in like 2014 or so. The idea was that as you wrote code, it would display on the side similar code that had been written by others doing the same or similar thing, and then it would allow you automatically upload small processing functions to some sort of cloud library. Same thing for doing autoformatting, although that's less of a concern now that formatters are becoming popular. The context I was working in was visual languages though. I had even started writing a tool during an "innovation week" (that I never showed) that would start visually classifying whether code written in the visual language was "good" or "clean" or not. I never got anywhere with it and mainly just have some diagrams generated from that project that were buggy so that they kind of look like art.
And as a bonus related to the article title, it literally lets you talk to your editor (ie you can press the keyboard shortcut and then give edit commands by voice[2]). I've been leaning on it heavily for the last few days and the setup feels really productive!
If you want to try it out you can install it here: https://marketplace.visualstudio.com/items?itemName=clippy-a...
You can also find the full source code here: https://github.com/corbt/clippy-ai/tree/main/vs-code-extensi...
I'd love feedback!
[1]: https://openai.com/blog/gpt-3-edit-insert/
[2]: I just wrote the voice command interface yesterday and it's still highly experimental. Relies on having ffmpeg installed on MacOS and doesn't work with all audio setups yet. But there's a clear path to making it more robust.
Copilot lets you do that in a way that is way beyond what a normal programming language would let you do, which of course has its own, very rigid, abstractions.
For some parts of the code you'll want to dive in and write every single line in painstaking detail. For others `# give me the industry standard analysis of this dataset` is maybe enough for your purposes. And being able to have that ability, even if you think of it as just another programming language in itself, is huge.
p.s. Does anyone know when Copilot will update the insecure example on their website? Or are they just trying to be honest with the possible quality issues with the generated code?
With this, I don't need to memorize the syntax OR be bottlenecked on looking at documentation or stack overflowing the commands I need.
In other words: you're celebrating the fact that a tool allows you to become more and more incompetent.
I don't have much hope for future generations at this point.
Aren't you at least a bit curious what new possibilities this technology could enable? What new discoveries could e.g. an expert doctor or a biologist achieve given access to programming tools without spending decades learning programming?
More importantly, only about 10% of my job time is spent doing work that is hands-on technical work. I'd say probably 1% of that total time is spent doing notebooks with dataframes. Whether I am competent or not is in no way determined by whether I can memorize the syntax to how to group by and count a dataframe. In fact I'd argue it's probably a poor use of time.
Whether memorizing things like syntax is part of competence or not is highly dependent on context. The ROI of me memorizing that specific syntax would probably be highly negative.
I'd fathom there are countless examples like that. There are people who only rarely need to code. There are people who code a lot but only rarely need to use a certain library or language. For people like that, making the code more accessible is a huge win (that includes IDEs, auto-complete or easy links to documentation, and things like Copilot).
i guess stack overflow has a similar problem but at least there people provide documentation, explanation, and helpful links. this just force feeds you some code. i dont see this as a positive movement for our industry as a whole
Can’t wait for this to be true! I will be treated as a demigod compared to them. Job security for life!
After all, "How to parse a CSV file in Python" is longer than "csv.reader(file)" but without knowing that "csv.reader" exists, you have no other way but to tell Google what you need.
This is how I already think of co-pilot, but these steroids seem to be mostly for prototyping.
SO often have comments and context such as "this works with 98% of browsers", "this isn't recommended, try X instead", "this works but can break library code because it changes the global scope", "this stopped working in version X" etc etc. Context like this can be important to take into account depending on what you're building.
It's a really really cool tool and a lot of these comments are just shallow dismissals from people who haven't actually used it and like to be reactionary on the internet because that's the world we live in apparently. But I think it works best when it's used by people with experience.
Hopefully future models with higher accuracy and research in grounding can get us to that point however.
2. I’m very skeptical of a small group of people reading a bunch of online comments and deciding what is “toxic” and “non-toxic”, even more so when it’s done with no clear definitions/guidelines. As their GitHub repo [0] says:
> Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic.
That said, this isn't the robot that replaces us, obviously. Making the process of getting to 80% faster is better for everyone, but the last 20 is tough and anything further needs real expertise. I like how promising this is for the masses.
For instance, Microsoft Power Automate should rank highly.
I think this technology will really shake up how we code.
It is also highly symbolic that the first AI (copilot) was created to save humans from repeating toil, while the second (classifier) is about controlling and limiting us.
I believe the author chose to apply his method to this particular example intentionally for the two above points, not because of the hype of toxicity.
// generate 1e13 different versions of bubble sort and add to db"In the name of what's Good & Right, you have to behave how we want you to ... or else."
Who is defining toxic speech? Where is that data being taken from?
This is the definition of using AI to set what the edges of “speech” should be based on potentially flawed data.
This is a clown world.
(Emphasis mine)
Following that link:
> Surge AI is a data labeling platform and workforce. Our labeling team pored over tens of thousands of social media comments to build this toxicity dataset. Each comment was then evaluated by multiple members of our team to determine its severity level.
My problem is with the dataset and datasets like this overall that sets the tone through AI of what is acceptable and what is not.
The negativity here just seems like sour grapes or weird goal posts.
Sure, it makes mistakes and needs verification. But know what also makes mistakes and needs verification? All the code I already manually write as I tediously ratchet towards a solution. Removing some cycles from that process is a win.
Just stubbing out close-enough boilerplate is a win by itself, like setting up an NLP pipeline or figuring out which menagerie of classes need to be instantiated and hooked up together to do basic things in some verbose libs/langs.
Can you give an example for this?
Indeed. Every negative comment I have seen here has been a shallow dismissal by someone who clearly hasn't engaged with the tool. I'm not sure why people here are so primed to shit all over anything potentially innovative, seemingly even without background knowledge. Like, is there something inherently offensive to coders about a model that threatens to do their job? Or is it just years and years of people getting burned by previous "AI" projects without knowing that this one is actually rather impressive and comes from good research?
Keep shallow dismissals to yourselves people. It's in the site's rules.
Maybe jealousy - people often downplay others' achievements to make theirs feel better. Or pride - "I don't need no stinking AI assistant! What are you saying? I couldn't write this myself?". I find the latter is a common reaction to static types too.
Which is maybe the point! As the article points out, remembering the correct incantation to get matplotlib to spit out a bar chart is hard[1]; I certainly have to look it up literally every time (well, these days, I just use tools which have more intuitive APIs, but that's maybe besides the point). I don't really know what it means to "binarize" a dataset, but apparently the language model did, and apparently seeing the giant stack trace when trying to plot a precision-recall curve was enough to prompt the article writer to realize such an operation might be useful. When you're doing exploratory analysis like this, keeping a train of thought going is extremely important, so avoiding paging back and forth to the scikit-learn documentation is obviously a huge win.
But, on the other hand, this isn't a "no-code" solution in any real sense, because for all intents and purposes the author really did all the difficult parts which would've been necessary for a "fully coded" solution: they knew the technical outcome they wanted and had very good domain knowledge to guide the solution, and, shoot, they still ended up needing to understand semantics of the programming language and abstractions they were working with in that stacktrace at the end. It's still extremely neat (and, presumably, useful) to see the computer was able to correctly guess at all the syntax and API interfaces for the most part[2], but I don't really think you can fault people for wanting to push back against the idea that this is somehow fundamentally transformative, since I think it's pretty obvious that the human is (still) doing the hard and interesting parts and the computer is (still) doing the tedious and boring parts. Maybe people shouldn't be getting flustered about a click-baity title over-promising a hip new technology, but as you say:
> Or is it just years and years of people getting burned by previous "AI" projects without knowing that this one is actually rather impressive and comes from good research?
There's definitely some of this.
---
[0] I wish I could find the link for this, but I'm very bad at google these days.
[1] To risk ascribing agency to a statistical model of github commits, it is sort of funny that the co-pilot pulled in seaborn as a dependency but then did everything directly with calls to plt and DataFrame.plot.
[2] I don't really have the expertise myself to tell you whether that scikit pipeline is at all reasonable, I suppose. It sure sounds fancy, though.
The problem is when it makes something that looks OK but does the opposite of what you want it to. See: machine translation
You can't classify a comment as boolean toxic, toxicity does not exist in a vacuum. To extend the analogy from it's biological counterpart, toxicity depends on the organism. You should never just a piece of text in isolation and draw any conclusion about it. It must understood in context, both that of the subject, the recipient and the sender.
Does that make sense?
However, the framing of the tutorial is clearly about using automated censorship at scale.
Someone is going to roughly copy-paste this into some forum software and call it a day.
This is some dystopian shit right here. I don't care what fancy models you train on it, or even what funny jokes you make of it. I'm just so done with this.