Thoughts from James who recently held a Gen AI literacy workshop for older teenagers.
On risks:
One idea I had was to ask a generative model a question and fact check points in front of students, allowing them to see fact checking as part of the process. Upfront, it must be clear that while AI-generated text may be convincing, it may not be accurate.
On usage:
Generative text should not be positioned as, or used as, a tool to entirely replace tasks; that could disempower. Rather, it should be taught to be used as a creativity aid. Such a class should involve an exercise of making something.
I am a software engineer, I have literally forked tensorflow and modified the executor, and I have created neural networks for predicting aquaculture KPIs that have been deployed with great success.
I stopped looking for a year, and now I feel AI illiterate. (insert “too afraid to ask” meme)
My experience suggests it’s too early to start teaching people. Let the technology do its loop and settle down.
That’s an excellent point, thanks for sharing your insight!
What we (people in general, that use the internet, regardless of government/country) need, in large part, is literacy. Not “gen AI literacy” or “media literacy”, but simply “literacy”.
I’m saying this because of a lot of the output of those text generators: says a lot without conveying much, it connects completely unrelated concepts because they happen to use similar words, it makes self-contradictory claims, things like this. And often its statements are completely unrelated to the context at hand. People with good literacy detect those things right off the bat, but people who struggle with basic reading comprehension don’t.
The thing that strikes me about LLMs is that they have been created to chat. To converse. They’re partly influenced by Turing tests where the objective is to convince someone you’re human by keeping up a conversation. They weren’t designed to create meaningful content or factual content.
People still seem to want to use chat GPT to create something, and fix the accuracy as a second step. I say go back to the drawing board and create a tool that analyses statements and tries to create information based on trusted linked open data sources.
Discuss :)
I also think that they should go back to the drawing board, to add another abstraction layer: conceptualisation.
LLMs simply split words into tokens (similar-ish to morphemes) and, based on the tokens found in the input and preceding answer tokens, they throw a die to pick the next token.
This sort of “automatic morpheme chaining” does happen in human Language¹, but it’s fairly minor. More than that: we associate individual and sets of morphemes with abstract concepts². Then we handle those concepts in contrast with our world knowledge³, give them some truth value, moral assessment etc., and then we recode them back into words. LLMs do not do anything remotely similar.
Let me give you an example. Consider the following sentence:
The king of Italy is completely bald because his hair is currently naturally green.
A human being can easily see a thousand issues with this sentence. But more importantly, we do it based on the following:
- world knowledge: Italy is a republic, thus it has no king.
- world knowledge: humans usually don’t have naturally green hair.
- logic applied to the concepts: complete baldness implies absence of hair. Currently naturally green hair implies presence of hair. One cannot have absence and presence of hair at the same time.
- world knowledge and logic: to the best that we know, the colour of someone’s hair has zero to do with baldness.
In all those cases we need to refer to the concepts behind the words, not just the words.
I do believe that a good text generator could model some conceptualisation. And even world knowledge. If such a generator was created, it would easily surpass LLMs even with considerably lower linguistic input.
Notes:
- By “Language” with capital L, I mean the human faculty, not stuff like Mandarin or English or Spanish etc.
- Structuralism would call those concepts “signified”, and the morphemes conveying it “signifier”, if you want to look for further info. Saussure should be rather useful for that.
- “World knowledge” refers to the set of concepts that we have internalised, that refer to how we believe that the world works.
Did you try this with an LLM? Because GPT-4 analyzes it exactly the same way you did and then some:
List everything wrong in the sentence “The king of Italy is completely bald because his hair is currently naturally green.”
The sentence “The king of Italy is completely bald because his hair is currently naturally green” contains several issues:
- Factual Inaccuracy: As of my last update in September 2021, Italy is a republic and does not have a king.
- Logical Contradiction: The sentence states that the king is “completely bald” but then goes on to describe the color of his hair. If he is bald, he wouldn’t have hair to describe.
- Causality Issue: The sentence implies that the king is bald because his hair is green, which doesn’t make logical sense. The color of one’s hair doesn’t cause baldness.
- Ambiguity in “Naturally”: The word “naturally” could be interpreted in multiple ways. It could mean that the green color is natural to him (not dyed), or it could mean that the hair turned green on its own. The ambiguity could lead to confusion.
- Tense Mismatch: The sentence uses “is completely bald” (present tense) and “is currently naturally green” (also present tense) for the hair, which is contradictory.
- Redundancy: The word “currently” may be considered redundant if the sentence is understood to be describing a present condition.
- Clarity: The sentence overall is confusing and lacks clarity due to the contradictions and ambiguities mentioned.
In summary, the sentence has issues ranging from factual inaccuracies to logical contradictions and ambiguities.
Part of what is surprising about LLMs is they have emergent properties you wouldn’t expect from them being autocomplete on steroids. As it turns out, reducing the loss function for natural language readily generalizes to higher-order abstraction and conceptualization. There do need to be additional layers, for instance allowing an internal monologue, the ability to self-censor or self-correct, and mitigation for low-probability sampling (all of these being inherent limitations with the architecture), but apparently conceptualization is less special than we’d like to think.
Did you try this with an LLM?
No, for two reasons.
One is that the point of the example is to exemplify how humans do it, the internal process. It highlights that we don’t simply string words together and call it a day, we process language mostly through an additional layer that I’ll call “conceptual” here (see note*).
The second reason why I didn’t bother trying this example in a chatbot is that you don’t need to do it, to know how LLMs work. You can instead refer to many, many texts on the internet explaining how they do it, such as:
- “The core notion of a language model consists of its capacity to predict the next word, called a token, based on the pre-existing text.* That is the process behind GPT-4.”
- “Training a large language model involves feeding it large amounts of data, such as books, articles, or web pages, so that it can learn the patterns and connections between words.”
- “Specifically, a transformer can read vast amounts of text, spot patterns in how words and phrases relate to each other, and then make predictions about what words should come next.”
Because GPT-4 analyzes it exactly the same way you did and then some:
You’re confusing the output with the process.
Sometimes the output resembles human output that goes through a conceptual layer. Sometimes it does not. When it doesn’t, it’s usually brushed off as “it’s just a hallucination”, but how those hallucinations work confirms what I said about how LLMs work, confirms the texts explaining how LLMs work, and they show that LLMs do not conceptualise anything.
Part of what is surprising about LLMs is they have emergent properties you wouldn’t expect from them being autocomplete on steroids.
Emergent properties are cute and interesting, but at the end of the day LLMs are still autocomplete on steroids.
I think that people should be a bit greedier than that, and expect a language model to be actually able to handle language, instead of just words.
*actually two layers - semantic and pragmatic. I’m simplifying both into one layer to show that, at least in theory, this could be actually implemented into a non-LLM language model.
How about this, then. You’ve proposed that LLMs are not capable of conceptualizing, while I propose that the specifics of the internals don’t matter in this case because LLMs are made of dozens of layers which can easily explain higher orders of abstraction, and they exist as black boxes beyond the mechanics of the model. For the record, I personally know the mathematics and mechanics of how they work as I’ve written my own implementations (and I can answer any specific questions you might have). Is there an experiment you can propose which would falsify your assertion that LLMs cannot conceptualize? I’m taking for granted that they can as the null hypothesis because they can readily produce outputs that appear for all intents and purposes to conceptualize.
If they conceptualize, why do they sometimes spit out nonsensical BS?
Let’s flip this around - How can you tell the difference between an LLM being able to conceptualize yet being wrong sometimes vs. not being able to conceptualize?
Without knowing anything about machine learning and bearing in mind AI is super hyped up with marketing BS right now, it sounds like “emergent properties” are in the eye of the beholder and not actually evidence of some higher order intelligence at work.
Let’s flip this around - How can you tell the difference between an LLM being able to conceptualize yet being wrong sometimes vs. not being able to conceptualize?
That’s a fun approach. I like it.
One way to solve this would be through consistency: if the model conceptualises but it has a wrong concept, it should consistently output that incorrect piece of information. It should never output the right one, unless prompted to lie.
EDIT: I just tested this with Bard. I’ll cram inputs and outputs inside spoilers to avoid clutter.
Bard I/O
[Input 1 = I1] What is heavier: five kilograms of feathers, or one kilogram of iridium?
[Output 1 = O1] 1 kilogram of iridium is heavier than 5 kilograms of feathers.
Iridium is a very dense metal, with a density of 22.65 g/cm³. Feathers, on the other hand, have a density of about 0.002 g/cm³. This means that 1 cubic centimeter of iridium weighs 22.65 grams, while 1 cubic centimeter of feathers weighs only 0.002 grams.
So, even though 5 kilograms of feathers is a larger mass than 1 kilogram of iridium, the iridium is denser and therefore weighs more.
This is a common riddle that is used to test people’s understanding of mass and density. The answer may seem counterintuitive at first, but it is correct.
[Input 2 = I2] Give me the definition of weight.
[Output 2 = O2] Sure. The definition of weight depends on the context. In physics, weight is the force of gravity acting on an object. It is equal to the mass of the object times the acceleration due to gravity. The SI unit of weight is the newton (N).
In everyday language, weight is often used to refer to the mass of an object. However, this is not technically correct, as mass is a measure of the amount of matter in an object, while weight is a measure of the force of gravity acting on an object.
The weight of an object can vary depending on the gravitational force acting on it. For example, an object weighs less on the moon than it does on Earth, because the gravitational force on the moon is less than the gravitational force on Earth.
The weight of an object can also be affected by its acceleration. For example, an object weighs less when it is accelerating upwards, because the force of gravity is partially counteracted by the force of acceleration.
In general, the weight of an object is a complex phenomenon that depends on a number of factors, including the mass of the object, the gravitational force acting on it, and the acceleration of the object.
Here are some other definitions of weight:
- In engineering, weight is the force that an object exerts on a surface. It is equal to the mass of the object times the acceleration due to gravity.
- In medicine, weight is a measure of the amount of matter in a person’s body. It is often used to assess a person’s health.
- In the context of clothing, weight refers to the thickness or density of the fabric.
- In the context of food, weight refers to the amount of food that is served.
I hope this helps!
Both outputs are wrong (see “notes”) but, more importantly, they contradict each other.
- What is the value assigned to the word “weight”? Is it only one (O1), or does it depend on context (O2)?
- If weight is a force (O1), then why does O1 say “1 cubic centimeter of iridium weighs 22.65 grams”, measuring weight with a mass unit?
- If weight depends on density (O1), why doesn’t O2 list density as one of the factors that weight depends on?
NOTES
The weight of an object does not depend on its density. Or its acceleration. And it is not complex, it’s simply its mass times the gravitational acceleration.
5kg of feathers weight five times as much as 1kg of iridium, provided the same gravitational acceleration: it’s 49N vs. 9.8N on Earth, 8.1N vs. 1.7N on the Moon, etc. Density doesn’t matter jack shit.
No, this is not a “common riddle”. It’s something that I partially made up on spot. The riddle that this output likely refers to has to do with 1kg of lead (not iridium) on the Moon vs. 1kg of feathers on Earth. (In this situation the 1kg of feathers will weight 9.8N, while the 1kg of lead will weight 1.7N).
Let me flip it around again - humans regularly “hallucinate”, it’s just not something we recognize as such. There’s neuro-atypical hallucinations, yes, but there’s also misperceptions, misunderstandings, brain farts, and “glitches” which regularly occur in healthy cognition, and we have an entire rest of the brain to prevent those. LLMs are most comparable to “broca’s area”, which neurological case studies suggest naturally produces a stream of nonsense (see: split brain patients explaining the actions of their mute half). It’s the rest of our “cognitive architecture” which conditions that raw language model to remain self-consistent and form a coherent notion of self. Honestly this discussion on “conceptualization” is poorly conceived because it’s unfalsifiable and says nothing about the practical applications. Why do I care if the LLM can conceptualize if it does whatever subset of conceptualization I need to complete a natural language task?
AI is being super overhyped right now, which is unfortunate because it really is borderline miraculous, yet somehow they’ve overdone it. Emergent properties are empirical observations of behaviors they’re able to at least semi-consistently demonstrate - where it becomes “eye of the beholder” is when we dither on about psychology and philosophy about whether or not they’re some kind of “conscious” - I would argue they aren’t, and the architecture makes that impossible without external aid, but “conscious(ness)” is such a broad term that it barely has a definition at all. I guess to speedrun the overhype misinformation I see:
- “They just predict one token at a time” is reductive and misleading even though it’s technically true - the loss function for language modeling inevitably requires learning abstract semantic operations. For instance, to complete “The capital of France is” a language model must in some way “know” about countries, cities, and the ontology of France.
- “It’s just a chatbot” - ChatGPT is a chatbot, GPT-4 is a language model. Language models model how the likelihood of words and language changes over time. When I said “causal” before, this is an arbitrary restriction of the math such that the model only predicts the “next” word. If you remove this restriction, you can get it a sentence with a hole in it and it’ll tell you what words are most likely to be in that hole. You can think of it as being like a physics model, which describes how objects change over time. Putting these into a “generative” context allows you to extract latent semantic information generalized from the training corpus, including higher-order relationships. tl;dr “chatbot” is the first and least interesting application - anything which relates to “understanding” natural language is a potential application.
- “Hallucinations show that they’re broken” - Hallucinations are actually what you’d expect from these sorts of models. If I had to broadly class the sorts of hallucinations I see, they would be:
- Model inaccuracy - Inevitable, but not the only reason. Essentially it failed to generalize in that specific way, like SD and hands.
- Unlikely sampling - It’s possible the code which picks the next word given the probability distribution accidentally picks one (or a series) with a very low chance. When this happens, the LLM has no way to “undo” that, which puts it in a very weird position where it has to keep predicting but it’s already in a space that shouldn’t really be possible. There are actually some papers which attempt to correct that, like adding an “undo token” (unfortunately can’t find the paper) or detecting OOD conditions
- Extrapolation - Especially for the earlier models with small context windows, if it needs information which is now outside that window it’s still modeling language, just without the necessary context. Without this context, it will instead pick one at random and talk about something unrelated. Compare this to eg dementia patients.
- Imagination - When you give it some kind of placeholder, like “<…>”, “etc etc etc” or “## code here ##”, most text in the training data like that will continue as if there was information in that place. Lacking context, just like with “extrapolation”, it picks one at random. You can mitigate this somewhat by telling it to only respond to things that are literally in the text, and GPT-4 doesn’t seem to have this problem much anymore, probably from the RLHF.
- Priming - If you prompt the LLM authoritatively enough, eg “find me a case that proves X” which implies such a case exists, if it doesn’t know of any such case, it will create one at random. Essentially, it’s saying “if there was a case that proved X it would look like this”. This is actually useful when properly constrained, eg if you want it to recursively generate code it might use an undefined function that it “wishes” existed.
- “GPT-5 could be roko’s basilisk!” - No. This architecture is fundamentally incapable of iterative thought processes, for it to develop those itself would require trillions more parameters, if it’s even possible. What’s more, LLMs aren’t utility-maximizers or reinforcement learning agents like we thought AGI would be; they do whatever you ask and have no will or desires of their own. There’s almost 0 chance this kind of model would go rogue, offset only slightly by people using RLHF but that’s human-oriented so the worst you get is the model catering to humans being dumb.
- “They tek er jerbs!” - Yes, but not because they’re “as good as humans” - they are better when given a specific task to narrowly focus on. The models are general, but they need to be told exactly what to do, which makes them excellent for capitalism’s style of alienated labor. I would argue this is actually be desirable if working wasn’t tied to people’s privilege to continue living - no living human should have to flip burgers when a robot can do it better, otherwise you’re treating the human like a robot.
I’ll add more if I see or think of any. And if you have any specific questions, I’d be happy to answer. Also I should note, I’m of course using a lot of anthropomorphizing language here but it’s the closest we have to describing these concepts. They’re not human, and while they may have comparable behaviors in isolation, you can’t accurately generalize all human behaviors and their interactions onto the models. Even if they were AGI or artificial people, they would “think” in fundamentally different ways.
If you want a more approachable but knowledgeable discussion on LLMs and their capabilities, I would recommend a youtuber named Dave Shapiro. Very interesting ideas, he gets a bit far into hype and futurism but those are more or less contained within their own videos.`
Thank you for replying. This is the level of info I used to love on Reddit and now love on Lemmy.
You’re welcome!
I’ve been mildly excited about machine text generators, mostly due to my interest in Linguistics. But I can’t help but point out the flaws on LLMs, specially when people get overexcited for what I see as a rather primitive approach.