Ingest Data

Gathering data from your data sources

The ingestData endpoint is part of the API, designed to send data for ingestion. The function returns a task ID upon an ingestion started. You can use this task_id to check for the ingestion status on the ingestion status endpoint.

POST /api.mendable.ai/v1/ingestData

Request

Here is an example request using cURL. The api_key must be a server-side API key which you can create in the Mendable dashboard.

curl -X POST https://api.mendable.ai/v1/ingestData \
  -H "Content-Type: application/json" \
  -d '{
        "api_key": "YOUR_API_KEY",
        "url": "URL_TO_INGEST",
        "type": "INGESTION_TYPE",
        "include_paths": ["PATH_TO_INCLUDE"],
        "exclude_paths": ["PATH_TO_EXCLUDE"]
      }'

Supported Ingestion Types

Website Crawler

The website crawler is designed to crawl your website and ingest all the pages. The crawler will follow all the links on the website and ingest all the pages it finds. You can also specify paths to include or exclude during the crawl.

# type: website-crawler

{
    "type": "website-crawler",
    "url": "https://docs.mendable.ai",
    "api_key": "YOUR_API_KEY",
    "include_paths": ["/blog/*", "/usecases/*"],
    "exclude_paths": ["/app?*"]
}

Docusaurus

The Docusaurus ingestion type is designed to crawl your Docusaurus website and ingest all the pages. The crawler will follow all the links on the website via the sitemap and ingest all the pages it finds.

# type: docusaurus

{
    "type": "docusaurus",
    "url": "https://docs.mendable.ai",
    "api_key": "YOUR_API_KEY"
}

GitHub

The GitHub ingestion type is designed to ingest all the documentation pages from a GitHub repository. The customization via the API is very limited right now and it will always default to your main branch. If you want further customization to ingest a GitHub repository, try out our sel-serve dashboard option here.

# type: github

{
    "type": "github",
    "url": "https://github.com/nickscamara/nickscamara",
    "api_key": "YOUR_API_KEY"
}

YouTube

The YouTube ingestion type is designed to ingest a YouTube video via its transcript.

# type: youtube

{
    "type": "youtube",
    "url": "https://www.youtube.com/watch?v=123456789",
    "api_key": "YOUR_API_KEY"
}

Single Website URL

You can ingest a single website URL by using the url in the ingestion type. This ingestion type is designed to ingest a single website URL. For now, this method won't work if the website needs JS enabled to render the content. We will be updating it soon.

# type: url
{
    "type": "url",
    "url": "https://docs.mendable.ai/installation",
    "api_key": "YOUR_API_KEY"
}

Sitemap

The Sitemap ingestion type is designed to ingest all the pages from a sitemap.

# type: sitemap

{
    "type": "sitemap",
    "url": "https://docs.mendable.ai/sitemap.xml",
    "api_key": "YOUR_API_KEY"
}

Example Usage

Request

Here is an example request using cURL:

curl -X POST https://api.mendable.ai/v1/ingestData \
  -H "Content-Type: application/json" \
  -d '{
        "api_key": "YOUR_API_KEY",
        "url": "URL_TO_INGEST",
        "type": "INGESTION_TYPE",
        "include_paths": ["PATH_TO_INCLUDE"],
        "exclude_paths": ["PATH_TO_EXCLUDE"]
      }'

or using Javascript:

const url = 'https://api.mendable.ai/v1/ingestData'

const data = {
  api_key: 'YOUR_API_KEY',
  url: 'URL_TO_INGEST',
  type: 'INGESTION_TYPE',
  include_paths: ['PATH_TO_INCLUDE'],
  exclude_paths: ['PATH_TO_EXCLUDE']
}

fetch(url, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify(data),
})
  .then((response) => response.json())
  .then((data) => console.log(data))
  .catch((error) => console.error('Error:', error))

Response Here is an example response:

{
  "task_id": 1234567890
}

Request Paremeters

Field	Type	Required	Description
api_key	string	Yes	Your Mendable API key
url	string	Yes	URL for data ingestion
type	string	No	Type of ingestion, defaults to "website-crawler". Available types are shown above.
include_paths	array	No	Paths to include during the crawl. Only applicable for "website-crawler" type.
exclude_paths	array	No	Paths to exclude during the crawl. Only applicable for "website-crawler" type.

The task_id is returned as an integer. You can use this to check the status of the ingestion task.

Ingesting Raw documents

We also support ingesting raw documents. This is useful if you want to ingest a document that you have already scraped or if you want to ingest a document that is not publicly available.

POST /api.mendable.ai/v1/ingestDocuments

Example Usage

Request

Here is an example request using cURL:

curl -X POST https://api.mendable.ai/v1/ingestDocuments \
  -H "Content-Type: application/json" \
  -d '{
        "api_key": "SERVER_SIDE_API_KEY",
        "documents": [
          {
            "content": "YOUR_CONTENT_1",
            "source": "yoursource.com",
            "metadata" : {  // optional
              "version" : 10,
              "author" : "John Doe"
            },
            "options": { // optional
              "summarize" : true,
              "summarize_max_chars" : 500
            }
          },
          {
            "content": "YOUR_CONTENT_2",
            "source": "yoursource2.com",
          },
        ]
      }'

Warning: Max number of documents is 500. There is also a limit of 2mb of documents per request, which is around 2,000,000 characters.

Metadata and Options

Metadata and options are optional parameters that can be included in the request.

Metadata is a key-value pair that can be used to add additional information about the document. For example, you can include the version of the document or the author's name. This information can be used later for filtering purposes.

Options is another key-value pair that can be used to specify how the document should be processed. For example, you can specify whether the document should be summarized and the maximum number of characters that should be included in the summary.

summarize is a boolean that specifies whether the document should be summarized. The default value is false.
summarize_max_chars is an integer that specifies the maximum number of characters that should be included in the summary. This is not guranteed, the AI will attempt to follow this limit but it may not be exact.

Check Ingestion Status

The ingestionStatus endpoint is part of the API, designed to check the status of an ongoing ingestion task. This function returns the status, metadata, and progress of the ingestion task.

POST /api.mendable.ai/v1/ingestionStatus

Example Usage

Request

Here is an example request using cURL:

curl -X POST https://api.mendable.ai/v1/ingestionStatus \
  -H "Content-Type: application/json" \
  -d '{
        "task_id": "YOUR_TASK_ID"
      }'

or using Javascript:

const url = "https://api.mendable.ai/v1/ingestionStatus";

const data = {
  task_id: "YOUR_TASK_ID",
};

fetch(url, {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
  },
  body: JSON.stringify(data),
})
  .then((response) => response.json())
  .then((data) => console.log(data))
  .catch((error) => console.error("Error:", error));

Response

Here is an example response:

{
  "status": "pending",
  "current": 17,
  "current_step": "SCRAPING",
  "metadata": "PENDING",
  "total": 400
}

When the ingestion succeed the response will look like this:

{
  "result": {
    "error": "",
    "project_id": 2453,
    "success": true
  },
  "status": "completed"
}

The response provides information about the ongoing task such as the current step, status, and total number of steps.

Request Parameters

Field	Type	Required	Description
task_id	string	Yes	The ID of the task for which the status needs to be fetched

Pending Response Parameters

Field	Type	Description
current	integer	The number of steps that have been completed in the ingestion task.
current_step	string	The current step of the ingestion process, such as "SCRAPING".
metadata	string	The status of metadata, typically "PENDING" until task completion.
status	string	The overall status of the ingestion task, typically "pending", "running", or "completed".
total	integer	The total number of steps in the ingestion task.

Completed Response Parameters

Field	Type	Description
result	object	The result of the ingestion task.
result.error	string	The error message, if any.
result.project_id	integer	The ID of the project that was created.
result.success	boolean	Whether the ingestion task was successful.
status	string	The overall status of the ingestion task, typically "pending" or "completed".

The task_id is a unique identifier for each ingestion task. This ID is used to track the progress of the ingestion. The response includes the current step of the task (current_step), its status (status), current progress (current), and the total number of steps (total).

The status field indicates whether the task is pending, in progress, or completed.

The current and total fields represent the number of steps completed and the total number of steps in the task, respectively.

If the task is in the SCRAPING step, this means that the data is currently being scraped from the provided URL. If the task is in the EMBEDDING step, this means that the data is currently being embedded.

Keeping your data updated (Auto Sync)

Mendable now offers reingestion for all users through the dashboard. To activate it, go to the Manage Indexes page. After you ingested from a data source that is supported (Website/Docs, GitHub, Notion, Zendesk + others), an auto sync option will appear in the Manage indexes page. You can then activate it and this will auto sync your data every 24 hours.

If you previously had a CRON job that was manually set up, it won't show up in the Auto Sync tab just yet.