AIStackWatch
Back to wiki

Text Embeddings

An embedding is a dense vector — usually 768 to 3072 floats — produced by a model that was trained so semantically similar inputs land close together in vector space. They are the bridge between natural language and numeric similarity.

What you can do with them

  • Semantic search — find documents close to a query in meaning rather than keywords.
  • Clustering — group support tickets, news articles, or user queries by topic.
  • Classification — train a tiny logistic head on top of embeddings to label text cheaply.
  • Deduplication — detect near-duplicate content that differs in wording.
  • Recommendation — "users who liked X" where X is represented as an embedding.

Choosing a model

  • OpenAI text-embedding-3-large — 3072 dims, strong general-purpose baseline, reasonable cost.
  • OpenAI text-embedding-3-small — 1536 dims, good enough for most apps at a fraction of the cost.
  • Cohere embed-v3 — multilingual strength, separate doc and query encoders.
  • Open modelsbge-m3, nomic-embed-text-v2, Stella — good if you want local inference and no API dependency.

The cheapest model that meets your recall target is the right one. Test on your data; generic leaderboards lie often.

What can go wrong

  • Dimension mismatch — you embedded with one model, switched to another, kept the index. Results are noise. Re-embed on model change.
  • Chunk-size mismatch — query is one sentence, documents are 2000 tokens each. Similarity is dominated by length bias. Match your chunk granularity to typical queries.
  • Language mismatch — English-trained embedding on Chinese text is garbage. Use a multilingual model or translate first.
  • Versioning blindness — providers silently update embedding models. Pin the version and store it alongside each vector.

When NOT to use embeddings

Exact-match lookups (order ID, SKU, UUID) belong in an index, not a vector store. Embeddings are for fuzzy semantic matching — anything the regex could solve is faster with a regex.