Text Embeddings

An embedding is a dense vector — usually 768 to 3072 floats — produced by a model that was trained so semantically similar inputs land close together in vector space. They are the bridge between natural language and numeric similarity.

What you can do with them

Semantic search — find documents close to a query in meaning rather than keywords.
Clustering — group support tickets, news articles, or user queries by topic.
Classification — train a tiny logistic head on top of embeddings to label text cheaply.
Deduplication — detect near-duplicate content that differs in wording.
Recommendation — "users who liked X" where X is represented as an embedding.

Choosing a model

OpenAI text-embedding-3-large — 3072 dims, strong general-purpose baseline, reasonable cost.
OpenAI text-embedding-3-small — 1536 dims, good enough for most apps at a fraction of the cost.
Cohere embed-v3 — multilingual strength, separate doc and query encoders.
Open models — bge-m3, nomic-embed-text-v2, Stella — good if you want local inference and no API dependency.

The cheapest model that meets your recall target is the right one. Test on your data; generic leaderboards lie often.

What can go wrong

Dimension mismatch — you embedded with one model, switched to another, kept the index. Results are noise. Re-embed on model change.
Chunk-size mismatch — query is one sentence, documents are 2000 tokens each. Similarity is dominated by length bias. Match your chunk granularity to typical queries.
Language mismatch — English-trained embedding on Chinese text is garbage. Use a multilingual model or translate first.
Versioning blindness — providers silently update embedding models. Pin the version and store it alongside each vector.

When NOT to use embeddings

Exact-match lookups (order ID, SKU, UUID) belong in an index, not a vector store. Embeddings are for fuzzy semantic matching — anything the regex could solve is faster with a regex.