Catch up with LLM

5 min readJul 20, 2023

This is all you need.

If you are new to the field of LLMs these papers will bring you up to date with it. There are other papers also. But I suggest you start with it. They will clear your concepts and help you understand how ChatGPT works or LLM in general. After reading these you can read any new LLM-based research paper and can understand all the technical terms.

Attention Is All You Need

Paper Link: https://arxiv.org/pdf/1706.03762.pdf

Here are some key points about the paper:

The paper proposes a new neural network architecture called the Transformer that relies entirely on attention mechanisms, eliminating recurrence and convolutions.
The Transformer uses a mechanism called multi-headed self-attention, where each position in the input sequence attends to all positions in the previous layer of the encoder/decoder. This allows for modeling long-range dependencies regardless of the distance between input/output positions.
Positional encodings are added to the input embeddings to inject information about the relative position of tokens since there is no recurrence. Sinusoidal functions are used for the positional encodings.
The Transformer achieves state-of-the-art results on English-German and English-French translation tasks, outperforming previous models like convolutional and recurrent neural networks.
The Transformer is much more parallelizable and faster to train than recurrent/convolutional models. It achieved the results on translation with 12 hours of training on 8 GPUs.
Ablation studies show the importance of different components like multi-headed attention, attention dimensionality, and model size. Label smoothing and dropout are also very helpful.
The Transformer also shows strong generalization capability by performing well on English constituency parsing with minimal task-specific tuning.

In summary, the Transformer introduces a novel architecture based solely on attention that achieves excellent results on translation and parsing while being much more parallelizable and faster to train. The multi-headed self-attention mechanism is a key contribution.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper Link: https://arxiv.org/abs/1810.04805

Here are the key points from the paper:

BERT stands for Bidirectional Encoder Representations from Transformers. It is a new language representation model that uses a bidirectional Transformer encoder.
BERT is pre-trained on large amounts of unlabeled text, allowing it to pre-learn deep bidirectional representations that can be fine-tuned for a wide range of NLP tasks.
The pre-training happens in two steps: Masked LM and Next Sentence Prediction. Masked LM masks some percentage of input tokens and tries to predict them. Next Sentence Prediction predicts if two sentences follow each other.
For fine-tuning, BERT adds task-specific output layers and is fine-tuned end-to-end on labeled data from the downstream tasks. Minimal task-specific parameters need to be learned from scratch.
BERT obtains state-of-the-art results on a variety of NLP tasks like question answering, textual entailment, sentiment analysis, and named entity recognition.
BERT shows the importance of bidirectional pre-training and suggests that large-scale pre-training is an integral part of many NLP systems.
The Transformer encoder architecture combined with pre-training on large general-domain data like Wikipedia and BooksCorpus allows BERT to learn very general language representations that can efficiently transfer to many tasks.

In summary, BERT demonstrates impressive performance from pre-training a bidirectional Transformer model on unlabeled text and then fine-tuning it on downstream tasks. It has become a popular and powerful approach for NLP.

Language Models are Unsupervised Multitask Learners

Paper Link: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Here are the key points from the paper:

Language models like GPT-2 can perform a variety of natural language tasks in a zero-shot setting without any fine-tuning, suggesting they are learning representations that are useful for multiple tasks.
The authors train GPT-2, a 1.5 billion parameter Transformer model, on a new dataset called WebText which contains over 8 million web documents.
GPT-2 achieves state-of-the-art results on 7 out of 8 tested language modeling datasets in a zero-shot setting. It also shows promising zero-shot performance on tasks like reading comprehension, summarization, translation, and question-answering.
The diversity of tasks GPT-2 can perform without any explicit supervision suggests that scale and task diversity are key factors for unsupervised multitask learning. As models are trained on more data covering more tasks, they learn more generally useful representations.
The results demonstrate the potential of pre-trained language models like GPT-2 as general-purpose NLP systems that can perform a variety of tasks in a zero-shot setting. Fine-tuning the models could lead to further performance gains.

In summary, the paper shows that large language models trained on diverse data like WebText implicitly learn representations that are useful for many different NLP tasks, demonstrating the promise of unsupervised multi-task transfer learning.

Language Models are Few-Shot Learners

Paper Link: https://arxiv.org/pdf/2005.14165.pdf

Here is a high-level summary of the key points in the paper:

The paper explores using large-scale language models for few-shot and zero-shot learning of natural language tasks. This involves giving the model just a few examples of a natural language description of a task, without any gradient updates or fine-tuning.
The authors train a 175 billion parameter autoregressive language model called GPT-3. They also train 7 smaller models ranging from 125M to 13B parameters for comparison.
GPT-3 is evaluated on over two dozen NLP datasets across a diverse range of tasks like translation, question answering, reading comprehension, etc. It is given either no examples (zero-shot), one example (one-shot), or around 10–100 examples (few-shot) of a task and asked to perform the task.
In the few-shot setting, GPT-3 is competitive and sometimes exceeds fine-tuned state-of-the-art models on several tasks. In zero-shot and one-shot settings, it shows promising results but still lags behind fine-tuned models.
GPT-3 also shows an ability to perform simple arithmetic, solve analogy problems, and use novel words after seeing them defined only once. It can also generate news articles that humans have difficulty distinguishing from real articles.
The authors analyze GPT-3 for biases and find it reflects stereotypes present in the training data related to gender, race, and religion. They discuss concerns around potential misuse and energy usage.
Overall, GPT-3 displays a broad set of capabilities and the ability to adapt to many tasks from just a few examples, suggesting promise for few-shot learning approaches using large language models.

Conclusion

The development of language models has seen rapid progress in recent years, evolving from recurrent networks to Transformers and now large-scale pre-trained models. Key innovations like attention mechanisms, bidirectional modeling, and pre-training have allowed models to achieve ever-improving performance on a diverse range of natural language tasks.

Models like BERT, GPT-2, and GPT-3 demonstrate the potential of language models as general-purpose NLP systems that can efficiently adapt to new tasks with minimal examples. The unsupervised multitask knowledge gained during pre-training produces representations that transfer readily to downstream tasks after fine-tuning.

While concerns remain around potential misuse, biases, and compute requirements, the capabilities unlocked by scale and task diversity are undeniable. Language models leverage the abundance of text data to learn broadly useful features, an exciting development in AI with many promising applications. Further research will likely continue to push the boundaries of what can be achieved. But the groundwork has been laid for language models that serve as versatile, adaptable platforms for a wide variety of NLP tasks.

Happy Learning!!

Catch up with LLM

Attention Is All You Need

Here are some key points about the paper:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Here are the key points from the paper:

Language Models are Unsupervised Multitask Learners

Here are the key points from the paper:

Language Models are Few-Shot Learners

Here is a high-level summary of the key points in the paper:

Conclusion

Written by Karan Jakhar