Embedbase Documentation

Embedbase

Before you start, you need get a an API key at app.embedbase.xyz (opens in a new tab).

Initializing

Installation

npm install embedbase-js

Creating a client

import { createClient } from 'embedbase-js'
 
// you can find the api key at https://embedbase.xyz
const apiKey = 'your api key'
// this is using the hosted instance
const url = 'https://api.embedbase.xyz'
 
const embedbase = createClient(url, apiKey)

Main Operations

Generating text

Embedbase supports 9+ LLMs, including OpenAI, Google, and many state-of-the-art open source ones. If you are interested in using other models, please contact us.

Remember that this count in your playground usage, for more information head to the billing page (opens in a new tab).

const data = await embedbase
  .dataset('my-documentation')
  .createContext('my-context')
 
const question = 'How do I use Embedbase?'
const prompt =
`Based on the following context:\n${data.join('\n')}\nAnswer the user's question: ${question}`
 
for await (const res of embedbase.useModel('openai/gpt-3.5-turbo-16k').streamText(prompt)) {
    console.log(res)
    // You, can, use, ...
}
 
// or const res = await embedbase.useModel('openai/gpt-3.5-turbo-16k').generateText(prompt)

You can list models with await embedbase.getModels()

Searching datasets

You can search your dataset(s) using natural language queries. Embedbase will return the most similar items to your query, along with their similarity scores.

// fetching data
const data = await embedbase
  .dataset('test-amazon-product-reviews')
  .search('best hot dogs accessories', { limit: 3 })
 
console.log(data)
// [
//   {
//       "similarity": 0.810843349,
//       "data": "This nice little hot dog toaster is a great addition to our kitchen. It is easy to use and makes a great hot dog. It is also easy to clean. I would recommend this to anyone who likes hot dogs."
//       "metadata": {
//         "path": "https://amazon.com/hotdogtoaster",
//         "source": "amazon"
//       },
//       "embedding": [0.35332, 0.23423, ...]
//   },
//   {
//       "similarity": 0.294602573,
//       "data": "200 years ago, people would never have guessed that humans in the future would communicate by silently tapping on glass",
//       "embedding": [0.76532, 0.23423, ...]
//   },
//   {
//       "similarity": 0.192932034,
//       "data": "The average car in space is nicer than the average car on Earth",
//       "embedding": [0.52342, 0.23423, ...]
//   },
// ]

You can also filter by metadata:

const data = await embedbase
  .dataset('test-amazon-product-reviews')
  .search('best hot dogs accessories')
  .where('source', '==', "amazon")

Using Bing Search

The point of internet search in embedbase is to combine your private information with latest public information.

Also remember that AIs like ChatGPT have limited knowledge to a certain date, for example try to ask ChatGPT about GPT4 or about Sam Altman talk with the senate (which happened few days ago), it will not know about it.

The recommended workflow is like this:

search your question using internet endpoint
(optional) add results to embedbase
(optional) search embedbase with the question
use .streamText() to get your question answered

const data = await embedbase
  // for example this is a very recent AI paper that LLM have no knowledge about
  .internetSearch('qlora machine learning paper')
console.log(data)
// [
//   {
//     "title": "Qlora: ...",
//     "description": "We present Qlora, ..."
//     "url": "https://arxiv.org/abs/2104.07540",
//   },
//   ...
// ]

Adding Data

You can add data like text to your dataset(s), if you need to add images or other format like audio, please reach out on Discord (opens in a new tab) and we will make it happen instantly!

You can also add metadata along the data. It can be useful, for example if you are feeding a LLM like ChatGPT, a typical best practice is to add the source of the text as metadata. For example an URL. Then you can ask the AI to add links or footnotes in it's output.

The simplest way to add data to embedbase is to use chunkAndBatchAdd which runs ideal parameters by default:

const documents = [
  { data: 'This is a document' },
  { data: 'This is another document' },
  { data: 'This is a third document' },
  { data: 'This is a fourth document' },
  { data: 'This is a fifth document' },
  { data: 'This is a sixth document' },
  { data: 'This is a seventh document' },
  { data: 'This is a eighth document' },
  { data: 'This is a ninth document' },
  { data: 'This is a tenth document', metadata: { path: 'https://google.com/abcd' }}
]
const data = await embedbase.dataset('test-amazon-product-reviews').chunkAndBatchAdd(documents)
console.log(data)
// [
//   {
//     "id": "eiew823",
//     "data": "This is a document",
//     "embedding": [0.1, 0.2, 0.3, ...]
//   },
//   {
//     "id": "zfuzfv",
//     "data": "This is another document",
//     "embedding": [0.1, 0.2, 0.3, ...]
//   },
//   {
//     "id": "egreegregr",
//     "data": "This is a third document",
//     "embedding": [0.1, 0.2, 0.3, ...]
//   },
//   ...
//   {
//     "id": "vsdfvdvd",
//     "data": "This is a tenth document",
//     "metadata": {
//       "path": "https://google.com/abcd"
//     },
//     "embedding": [0.1, 0.2, 0.3, ...]
//   }
// ]

You can also use add, or batchAdd.

Extras

Updating data

To update data, embedbase offers multiple path:

For example you have files that you add to embedbase, that sometimes change on your side and want to update it in embedbase in this situation, you should add a metadata key to identify each files, for example name and then use replace

const documents = [
  {
    data: 'Nietzsche - Thus Spoke Zarathustra - Man is a rope, tied between beast and overman — a rope over an abyss.',
    metadata: {
      tag: 'philosophy',
    },
  },
  {
    data: 'Marcus Aurelius - Meditations - He who lives in harmony with himself lives in harmony with the universe',
    metadata: {
      tag: 'philosophy',
    },
  }
 ]
 await embedbase.dataset('library').batchAdd(documents)
 
 res = await embedbase.dataset('library').replace([{
  data: 'Nietzsche - Thus Spoke Zarathustra - One must have chaos within oneself, to give birth to a dancing star.'
 }, {
  data: 'Marcus Aurelius - Meditations - The happiness of your life depends upon the quality of your thoughts.'
 }, {
  data: 'Lao Tzu - Tao Te Ching - When I let go of what I am, I become what I might be.'
 }], 'tag', '==', 'philosophy')
 console.log(res)
 // [
 //   {
 //     data: 'Nietzsche - Thus Spoke Zarathustra - One must have chaos within oneself, to give birth to a dancing star.',
 //     metadata: {
 //       tag: 'philosophy',
 //     },
 //   },
 //   {
 //     data: 'Marcus Aurelius - Meditations - The happiness of your life depends upon the quality of your thoughts.',
 //     metadata: {
 //       tag: 'philosophy',
 //     },
 //   },
 //   {
 //     data: 'Lao Tzu - Tao Te Ching - When I let go of what I am, I become what I might be.',
 //     metadata: {
 //       tag: 'philosophy',
 //     },
 //   }
 // ]

Note that here it will replace all documents tagged philosophy with the given documents (keeping the previous metadata).

You can also decide to store ids that are returned my most embedbase functions and use it to uddate your data:

const data =
  await embedbase.dataset('test-amazon-product-reviews').update([{
    // you get this id from add/batchAdd/search/list response
    id: 'eiew823',
    data: 'some new text',
    metadata: {
      path: 'https://google.com/new'
    }
  }])
  console.log(data)
// {
//   "id": "eiew823",
//   "data": "some new text",
//   "metadata": {
//     "path": "https://google.com/new"
//   },
//   "embedding": [0.1, 0.2, 0.3, ...]
// }

Lastly, you might just create a new dataset every time something change, that assume you always have access to all the data for example, a github repository.

Splitting and chunking large texts

AI models are often limited in the amount of text they can process at once. Embedbase provides a utility function to split large texts into smaller chunks. We highly recommend using this function.

If you're just getting started we recommend using the abstraction chunkAndBatchAdd which runs ideal parameters by default:

embedbase.chunkAndBatchAdd(...)

Otherwise, To split and chunk large texts, use the splitText function:

import { splitText } from 'embedbase-js';
 
const text = 'some very long text...';
// ⚠️ note here that the value of chunkSize depends
// on the used embedder in embedbase.
// With models such as OpenAI's embeddings model, you can
// use a chunkSize of 500. With other models, you may need to
// use a lower chunkSize value.
// (embedbase cloud use openai model at the moment) ⚠️
const chunkSize = 500
// chunk_overlap is the number of tokens that will overlap between chunks
// it is useful to have some overlap to ensure that the context is not
// cut off in the middle of a sentence.
const chunkOverlap = 200
splitText(text, { chunkSize, chunkOverlap }).map(({chunk}) =>
    embedbase.dataset('some-data-set').batchAdd([{data: chunk}])
)

Creating a "context"

createContext is very similar to .search but it returns strings instead of an object. This is useful if you want to easily feed it to GPT.

// you can create a context to store data
const data = await embedbase
  .dataset('my-documentation')
  .createContext('my-context')
 
console.log(data)
[
 "Embedbase API allows to store unstructured data...",
 "Embedbase API has 3 main functions a) provides a plug and play solution to store embeddings b) makes it easy to connect to get the right data into llms c)..",
 "Embedabase API let you use hundreds of llms with a unified api...",
]

Listing datasets

const data = await embedbase.datasets()
console.log(data)
// [{"datasetId": "test-amazon-product-reviews", "documentsCount": 2}]

Listing documents

const data = await embedbase.dataset('test-amazon-product-reviews').list()
console.log(data)
// [
//   {
//     "id": "eiew823",
//     "data": "Lightweight. Telescopic. Easy zipper case for storage.
//          Didn't put in dishwasher. Still perfect after many uses.",
//     "metadata": {"path": "https://www.amazon.com/dp/B00004OCNS"},
//     "embedding": [0.1, 0.2, 0.3, ...]
//   },
//   {
//     "id": "uzvuzv",
//     "data": "Lightweight. Telescopic. Easy zipper case for storage.
//          Didn't put in dishwasher. Still perfect after many uses.",
//     "metadata": {"path": "https://www.amazon.com/dp/B00004OCNS"},
//     "embedding": [0.1, 0.2, 0.3, ...]
//   }
// ]

Clearing a dataset

Be careful, this will delete all the documents in the dataset and cannot be recovered.

await embedbase.dataset('test-amazon-product-reviews').clear()

Contributing

We welcome contributions to Embedbase (opens in a new tab).

If you have any feedback or suggestions, please open an issue or join our Discord (opens in a new tab) to discuss your ideas.

🏡 Start Here 📡 API