Text Embeddings
An embedding is a dense vector — usually 768 to 3072 floats — produced by a model that was trained so semantically similar inputs land close together in vector space. They are the bridge between natural language and numeric similarity.
What you can do with them
- Semantic search — find documents close to a query in meaning rather than keywords.
- Clustering — group support tickets, news articles, or user queries by topic.
- Classification — train a tiny logistic head on top of embeddings to label text cheaply.
- Deduplication — detect near-duplicate content that differs in wording.
- Recommendation — "users who liked X" where X is represented as an embedding.
Choosing a model
- OpenAI
text-embedding-3-large— 3072 dims, strong general-purpose baseline, reasonable cost. - OpenAI
text-embedding-3-small— 1536 dims, good enough for most apps at a fraction of the cost. - Cohere
embed-v3— multilingual strength, separate doc and query encoders. - Open models —
bge-m3,nomic-embed-text-v2, Stella — good if you want local inference and no API dependency.
The cheapest model that meets your recall target is the right one. Test on your data; generic leaderboards lie often.
What can go wrong
- Dimension mismatch — you embedded with one model, switched to another, kept the index. Results are noise. Re-embed on model change.
- Chunk-size mismatch — query is one sentence, documents are 2000 tokens each. Similarity is dominated by length bias. Match your chunk granularity to typical queries.
- Language mismatch — English-trained embedding on Chinese text is garbage. Use a multilingual model or translate first.
- Versioning blindness — providers silently update embedding models. Pin the version and store it alongside each vector.
When NOT to use embeddings
Exact-match lookups (order ID, SKU, UUID) belong in an index, not a vector store. Embeddings are for fuzzy semantic matching — anything the regex could solve is faster with a regex.