Overview
- On Pi day (3/14/2023), OpenAI unveiled their latest GPT model, GPT-4, which boasts a multimodal interface that takes in images and texts and emits out a text response.
- GPT-4 was trained on a large dataset of text from the internet and fine-tuned using RLHF.
- The GPT-4 Technical Report gives us a glimpse of how GPT-4 works and it’s capabilities and that is what I have explained below.
Capabilities of GPT-4
- Let’s do a quick refresher on GPT:
- During pre-training, the model is trained to predict the next word in a sequence of text given the previous words.
- Once the model is pre-trained, it can be fine-tuned on a specific task by adding a few task-specific layers on top of the pre-trained model and training it on a smaller dataset that is specific to the task.
- While the paper is lacking on technical details about GPT-4, we can still fill in the gaps with information we do know.
- As the paper states, “GPT-4 is a Transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers.” (source)
- Like it’s previous models, it was trained to predict the next word in a document or from publicly available data.
- Another piece of information we glean from this technical analysis is that GPT-4 uses Reinforcement Learning from Human Feedback (RLHF) much like InstructGPT has.
- GPT-4 uses RLHF to closely “align” the user’s intent for a given input and helps facilitate trust and safety in its responses.
- The table below (sourced from the paper) depicts how GPT-4 performs on a variety of tests:
data:image/s3,"s3://crabby-images/5a2ea/5a2eaeb2842872b18630c1742d521b7ffafb9e78" alt=""
- Additionally, like its predecessors, GPT-4 is able to work with multiple languages and translate between them.
- As per the demo, it seems like GPT’s coding ability has been significantly bolstered compared to its predecessors.
- Now let’s look at some examples involving visual input (source):
data:image/s3,"s3://crabby-images/9fa94/9fa941683e58027bbe95a631ce5053c31441cb76" alt=""
data:image/s3,"s3://crabby-images/9bc8a/9bc8a320703280097a3e048da6bdb6a146f2200b" alt=""
- While we don’t have details on the visual architecture, we can see that it is able to take the image and run either of two tasks:
- If the image is a paper or contains text, it’s able to convert the image to text, proceed to understand the context, and finally return a response.
- Otherwise, if the image just contains objects and no text, it’s still able to glean information and return a response, likely still with the use of NLP and language contexting.
GPT-4 vs. GPT-3
- Let’s now explore the ways in which GPT-4 differs from GPT-3, including its ability to perform tasks that GPT-3 struggled with, as well as the technical features that make it more robust.
- In the demo given by Greg Brockman, President and Co-Founder of OpenAI, the first task that GPT-4 outperformed its predecessor on was summarization.
- Specifically, GPT-4 is able to summarize a corpus with more complex requirements, for example, “Summarize this article but with all words starting with a letter ‘G’”.
- In terms of using the model as a coding assistant, you are now able to not only ask it to write code for a specific task, but just copy and paste any errors that code may cause without any context and the model is able to understand and make the code fixes.
- One of the coolest tasks that GPT-4 was able to perform was taking a blueprint of a website, hand-drawn in a notebook, and was able to build the entire website in a matter of minutes as the images below show (source):
data:image/s3,"s3://crabby-images/ddf45/ddf4562beb93960b64ffeb31a812bf3a3de92d60" alt=""
data:image/s3,"s3://crabby-images/dda1b/dda1b52801a2ecbd85cb3baac2f72b40bdc6cdd2" alt=""
- Additionally, the model is now able to perform really well on academic exams. This shows how much language models have improved in general reasoning capabilities.
- “For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%.” (source)
data:image/s3,"s3://crabby-images/012a2/012a207972aedfabcf5f1da4eb1843ec8549c5ef" alt=""
- GPT outperforms the previous state-of-the-art models on other standardized exams as well such as: GRE, SAT, BAR, APs as well as other research benchmarks such as: MMLU, HellaSWAG and TextQA (source).
data:image/s3,"s3://crabby-images/822fa/822faf4e2ba3ac0ff78e8c857ec6139a922d32bf" alt=""
- Now, lets look at the technical details of how GPT-4 has outperformed its predecessors.
- GPT-4 is capable of handling input context consisting of 8,192 to 32,000 words of text, which means it allows for a longer range of context (~50 pages max).
- The image below (source) shows GPT-4 on traditional benchmarks for machine learning models and it is able to outperform existing models as well as most of the SOTA models on most benchmarks.
data:image/s3,"s3://crabby-images/b72d7/b72d7cc0470df23a3fabc67ad6da3184538bdba7" alt=""
Limitations
- GPT-4, like its predecessors, still hallucinates facts and make errors in terms of reasoning, thus, the output needs to be verified before it is used from these models.
- Much like ChatGPT, GPT-4 lacks knowledge of events that have occurred past the date of its data cut-off, which is September 2021.
Use-case: multimodal search engine
- Unlike prior GPT family models, we have far less technical details on GPT-4 possibly because it is what powers Bing as confirmed below.
data:image/s3,"s3://crabby-images/fe80e/fe80e67513efdff86fe300c0928eaca6f0a6aa9c" alt=""
- It’ll be fascinating moving forward to see how a multimodal powered search engine can help improve our lives!