GitHub - 0hq/tinyvector: A tiny nearest-neighbor embedding database built with SQLite and Pytorch. (In development!)

tinyvector - the tiny, least-dumb, speedy vector embedding database.
No, you don't need a vector database. You need tinyvector.

In pre-release: prod-ready by late-July. Still in development, not ready!

Features

Tiny: It's in the name. It's just a Flask server, SQLite DB, and Numpy indexes. Extremely easy to customize, under 500 lines of code.
Fast: Tinyvector wlll have comparable speed to advanced vector databases when it comes to speed on small to medium datasets.
Vertically Scales: Tinyvector stores all indexes in memory for fast querying. Very easy to scale up to 100 million+ vector dimensions without issue.
Open Source: MIT Licensed, free forever.

Soon

Powerful Queries: Tinyvector is being upgraded with full SQL querying functionality, something missing from most other databases.
Integrated Models: Soon you won't have to bring your own vectors, just generate them on the server automaticaly. Will support SBert, Hugging Face models, OpenAI, Cohere, etc.
Python/JS Client: We'll add a comprehensive Python and Javascript package for easy integration with tinyvector in the next two weeks.

Versions

🦀 tinyvector in Rust: tinyvector-rs
🐍 tinyvector in Python: tinyvector

We're better than ...

In most cases, most vector databases are overkill for something simple like:

Using embeddings to chat with your documents. Most document search is nowhere close to what you'd need to justify accelerating search speed with HNSW or FAISS.
Doing search for your website or store. Unless you're selling 1,000,000 items, you don't need Pinecone.
Performing complex search queries on a very large database. Even if you have 2 million embeddings, this might still be the better option due to vector databases struggling with complex filtering. Tinyvector doesn't support metadata/filtering just yet, but it's very easy for you to add that yourself.

Usage

// Run the server manually:
pip install -r requirements
python -m server

// Run tests:
pip install pytest pytest-mock
pytest

Embeddings?

What are embeddings?

As simple as possible: Embeddings are a way to compare similar things, in the same way humans compare similar things, by converting text into a small list of numbers. Similar pieces of text will have similar numbers, different ones have very different numbers.

Read OpenAI's explanation.

Get involved

tinyvector is going to be growing a lot (don't worry, will still be tiny). Feel free to make a PR and contribute. If you have questions, just mention @willdepue.

Some ideas for first pulls:

Add metadata and allow querying/filtering. This is especially important since a lot vector databases literally don't have a WHERE clause lol (or just an extremely weak one). Not a problem here. Read more about this.
Rethinking SQLite and choosing something. NOSQL feels fitting for embeddings?
Add embedding functions for easy adding text (sentence transformers, OpenAI, Cohere, etc.)
Let's start GPU accelerating with a Pytorch index. GPUs are great at matmuls -> NN search with a fused kernel. Let's put 32 million vectors on a single GPU.
Help write unit and integration tests.
See all active issues!

Known Issues

# Major bugs:
Data corruption SQLite error? Stored vectors end up changing. Replicate by creating a table, inserting vectors, creating an index and then screwing around till an error happens. Dims end up unmatched (might be the blob functions or the norm functions most likely, but doesn't explain why the database is changing).
PCA is not tested, neither is immutable Brute Force index.

License

MIT

Name	Name	Last commit message	Last commit date
Latest commit anantsnh Update version in setup Jul 12, 2023 3d21641 · Jul 12, 2023 History 65 Commits
assets	assets	large migration 2	Jul 12, 2023
server	server	migration nearly complete	Jul 12, 2023
tests	tests	migration nearly complete	Jul 12, 2023
tinyvector	tinyvector	package tinyvector and split up server folder	Jul 11, 2023
.Dockerfile	.Dockerfile	migration nearly complete	Jul 12, 2023
.dockerignore	.dockerignore	migration nearly complete	Jul 12, 2023
.gitignore	.gitignore	migration nearly complete	Jul 12, 2023
LICENSE	LICENSE	Create LICENSE	Jul 3, 2023
README.md	README.md	Update README.md	Jul 12, 2023
pyproject.toml	pyproject.toml	Added a test for creating of table	Jul 7, 2023
requirements.txt	requirements.txt	migration nearly complete	Jul 12, 2023
setup.py	setup.py	Update version in setup	Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Soon

Versions

We're better than ...

Usage

Embeddings?

Get involved

Known Issues

License

About

Releases

Packages

Contributors 3

Languages

License

0hq/tinyvector

Folders and files

Latest commit

History

Repository files navigation

Features

Soon

Versions

We're better than ...

Usage

Embeddings?

Get involved

Known Issues

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages