NLP (Natural Language Processing) and LLM (Large language models) key terminology primer

With the dawn of transformer architecture based on self attention mechanism paper published in 2017 and OpenAI's introduction of the first LLM in the world, "chatGPT" in late 2022,which is designed to generate human-like coherent text response based on input prompts provided by the end user, the LLM technology has taken the world by storm and it has become the most popular technology being used in the world in a very short period of time having a lot of applications in various domains and industries.

Now-a-days everyone wants to learn about NLP and LLM technologies in depth. But NLP has its own technical jargon and there are some key terminologies which must be understood first before diving deeper into the actual NLP and LLM understanding.

This post, to start with, attempts to provide a good ,beginner friendly explanation with suitable examples, aiming for a good clarity about NLP and LLM technology's common key technical terms, i.e., vector, vector database, tokenization, embedding etc.

Vector

A vector in NLP is a list of numbers that represents a word, phrase, or document in a way that computers can understand and process.

Example: Let's say we have a simple 3-dimensional vector space where each dimension represents a concept:

Dimension 1: Animal-ness

Dimension 2: Size

Dimension 3: Domestication

In this space, words might be represented as:

"cat" = [0.9, 0.2, 0.8]

"elephant" = [0.9, 0.9, 0.3]

"microbe" = [0.5, 0.01, 0.0]

These vectors capture that cats and elephants are both animals (high value in dimension 1), elephants are larger than cats (higher value in dimension 2 for elephant), and cats are more domesticated than elephants (higher value in dimension 3 for cat).

Vector Database

A vector database is a specialized system for storing and querying these vectors efficiently.

Example: Imagine you're building a recipe recommendation system. You convert each recipe into a vector based on its ingredients, cooking time, cuisine type, etc. With millions of recipes, you need a way to quickly find similar recipes.

A vector database allows you to store these recipe vectors and perform queries like "Find the 10 most similar recipes to this vegetarian pasta dish" very quickly. It uses techniques like approximate nearest neighbour search to efficiently find similar vectors without checking every single one.

Popular vector databases in 2024 include Pinecone, Weaviate, and Milvus, with newer entries like Qdrant gaining a lot of traction.

Tokenization

Tokenization is the process of breaking text into smaller units (tokens) that a model can process.

Example: Let's tokenize the sentence: "I love NLP! It's fascinating."

Word-level tokenization: ["I", "love", "NLP", "!", "It's", "fascinating", "."]

Subword tokenization (as used by many modern models): ["I", "love", "NL", "P", "!", "It", "'s", "fascin", "ating", "."]

Notice how "NLP" and "fascinating" are broken into smaller parts. This helps the model handle rare or unseen words by understanding their components.

Embedding

An embedding is a learned vector representation of a token that captures its meaning and relationships to other tokens.

Example: Let's consider a simplified 3-dimensional embedding space:

"king" = [0.9, 0.1, 0.7] "queen" = [0.9, 0.1, -0.7] "man" = [0.5, 0.1, 0.7] "woman" = [0.5, 0.1, -0.7]

In this space:

The first dimension might represent "royalty" (high for king/queen, lower for man/woman)

The second dimension might represent "human" (similar for all)

The third dimension might represent "gender" (positive for male, negative for female)

These embeddings capture semantic relationships. For instance, one could perform vector arithmetic:

"king" - "man" + "woman" ≈ "queen"

[0.9, 0.1, 0.7] - [0.5, 0.1, 0.7] + [0.5, 0.1, -0.7] ≈ [0.9, 0.1, -0.7]

This demonstrates how embeddings can capture complex relationships between words.

Recent advancements (as of 2024)

Contextual Embeddings

Modern models like BERT((Bidirectional Encoder Representations from Transformers) and its successors generate different embeddings for the same word based on its context in a sentence.

Example: "I'll bank the plane." vs "I'll go to the bank." The word "bank" would have different embeddings in these sentences, reflecting its different meanings.

Multimodal Embeddings

These represent not just text, but also images, audio, or video in the same vector space.

Example: A multimodal model might embed the word "cat", an image of a cat, and the sound of a meow in nearby locations in its vector space, allowing for cross-modal understanding and generation.

Efficient Embeddings

Techniques like product quantization and locality-sensitive hashing have made it possible to work with larger embedding spaces more efficiently.

Multilingual Embeddings
Advanced models can now create embeddings that work across hundreds of languages simultaneously.
Example: The English word "dog" and the French word "chien" might have very similar embeddings, allowing for improved translation and cross-lingual understanding.

Search This Blog

PANKAJ'S CONTEMPLATIONS