The Distributional Hypothesis
"You shall know a word by the company it keeps" - J.R. Firth. This principle, combined with high-dimensional geometry and neural network learning, creates the foundation for modern semantic search. Words appearing in similar contexts share semantic properties, enabling mathematical representation of meaning.
J(θ) = -1/T Σt=1 to T Σ-w≤j≤w, j≠0 log P(wt+j | wt; θ)
Where:
• θ = embedding parameters
• T = corpus size
• w = context window size
• P(wt+j | wt; θ) = probability of context word given center word
The remarkable property of learned representations is linear compositionality in the embedding space. The famous example "king - man + woman ≈ queen" demonstrates that semantic relationships become geometric transformations. This emerges naturally from the learning objective as the model encodes consistent vector offsets for similar relationships.
The Geometry of Meaning: Manifold Hypothesis
High-dimensional embedding spaces aren't uniformly populated. Semantically valid representations lie on lower-dimensional manifolds within the embedding space. This explains why vector databases work despite the curse of dimensionality—we're navigating meaning manifolds, not the full d-dimensional space.
Correlation Dimension: Dc = limr→0 [log N(r)] / [log r]
Empirical measurements on text embeddings:
• 768-dimensional space → Dc ≈ 50-100
• Effective dimensions: 50-100 despite 768 ambient dimensions
• Manifold structure enables efficient indexing
Information Capacity Bounds
- Max information: d × b bits (d dimensions, b bits each)
- Actual capacity: Depends on covariance structure
- Differential entropy: h(X) = ½ log[(2πe)^d |Σ|]
- Optimal quantization: ~3.32 bits per dimension
Practical Implications
- 4-bit quantization: Near-optimal from info theory
- Dimension reduction: Often succeeds due to low intrinsic dim
- HNSW success: Navigates manifolds, not ambient space
- Storage efficiency: Can compress 8-32x with minimal loss