NLP (Natural Language Processing) and LLM (Large language models) key terminology primer
With the dawn of transformer architecture based on self attention mechanism paper published in 2017 and OpenAI's introduction of the first LLM in the world, "chatGPT" in late 2022,which is designed to generate human-like coherent text response based on input prompts provided by the end user, the LLM technology has taken the world by storm and it has become the most popular technology being used in the world in a very short period of time having a lot of applications in various domains and industries.
Now-a-days everyone wants to learn about NLP and LLM technologies in depth. But NLP has its own technical jargon and there are some key terminologies which must be understood first before diving deeper into the actual NLP and LLM understanding.
This post, to start with, attempts to provide a good ,beginner friendly explanation with suitable examples, aiming for a good clarity about NLP and LLM technology's common key technical terms, i.e., vector, vector database, tokenization, embedding etc.
Vector
A vector in NLP is a list of numbers that represents a word, phrase, or document in a way that computers can understand and process.
Example: Let's say we have a simple 3-dimensional vector space where each dimension represents a concept:
- Dimension 1: Animal-ness
- Dimension 2: Size
- Dimension 3: Domestication
- "cat" = [0.9, 0.2, 0.8]
- "elephant" = [0.9, 0.9, 0.3]
- "microbe" = [0.5, 0.01, 0.0]
These vectors capture that cats and elephants
are both animals (high value in dimension 1), elephants are larger than cats
(higher value in dimension 2 for elephant), and cats are more domesticated than
elephants (higher value in dimension 3 for cat).
Vector Database
A vector database is a specialized system for
storing and querying these vectors efficiently.
Example: Imagine you're building a recipe
recommendation system. You convert each recipe into a vector based on its
ingredients, cooking time, cuisine type, etc. With millions of recipes, you
need a way to quickly find similar recipes.
A vector database allows you to store these
recipe vectors and perform queries like "Find the 10 most similar recipes
to this vegetarian pasta dish" very quickly. It uses techniques like
approximate nearest neighbour search to efficiently find similar vectors without
checking every single one.
Popular vector databases in 2024 include
Pinecone, Weaviate, and Milvus, with newer entries like Qdrant gaining a lot of traction.
Tokenization
Tokenization is the process of breaking text into smaller units (tokens) that a model can process.
Example: Let's tokenize the sentence: "I love NLP! It's fascinating."
Word-level tokenization: ["I",
"love", "NLP", "!", "It's",
"fascinating", "."]
Subword tokenization (as used by many modern
models): ["I", "love", "NL", "P",
"!", "It", "'s", "fascin",
"ating", "."]
Notice how "NLP" and
"fascinating" are broken into smaller parts. This helps the model
handle rare or unseen words by understanding their components.
Embedding
An embedding is a learned vector representation of a token that captures its meaning and relationships to other tokens.
Example: Let's consider a simplified
3-dimensional embedding space:
"king" = [0.9, 0.1, 0.7]
"queen" = [0.9, 0.1, -0.7] "man" = [0.5, 0.1, 0.7]
"woman" = [0.5, 0.1, -0.7]
In this space:
- The first dimension might represent "royalty" (high for king/queen, lower for man/woman)
- The second dimension might represent "human" (similar for all)
- The third dimension might represent "gender" (positive for male, negative for female)
These embeddings capture semantic
relationships. For instance, one could perform vector arithmetic:
"king" - "man" +
"woman" β "queen"
[0.9, 0.1, 0.7] - [0.5, 0.1, 0.7] + [0.5, 0.1,
-0.7] β [0.9, 0.1, -0.7]
This demonstrates how embeddings can capture complex relationships between words.
Recent advancements (as of 2024)
- Contextual Embeddings
Modern models like BERT((Bidirectional Encoder Representations from Transformers) and its successors generate different embeddings for the same word based on its context in a sentence.
Example: "I'll bank the plane." vs "I'll go to the bank." The word "bank" would have different embeddings in these sentences, reflecting its different meanings.
- Multimodal Embeddings
These represent not just text, but also images, audio, or video in the same vector space.
Example: A multimodal model might embed the
word "cat", an image of a cat, and the sound of a meow in nearby
locations in its vector space, allowing for cross-modal understanding and
generation.
- Efficient Embeddings
Techniques like product quantization and locality-sensitive hashing have made it possible to work with larger embedding spaces more efficiently.
- Multilingual Embeddings
Advanced models can now create embeddings that work across hundreds of languages simultaneously.
Example: The English word "dog" and the French word "chien" might have very similar embeddings, allowing for improved translation and cross-lingual understanding.
Comments
Post a Comment