Overview
Here is the list of different parameters you can use with DataClean's APIs.
Getting Started
DataClean API allows you to extract structured data from unstructured text. Our API provides two main endpoints:
1. Extract Information from files through links or base64 encoded attachments
The /api/extract-info endpoint extracts structured or unstructured information from pdfs (text or image based) or docx documents based on your provided requirements.
POST https://dataclean.tech/api/file-extract
2. Extract Information from text
The /api/extract-info endpoint extracts structured information from text based on your provided format.
POST https://dataclean.tech/api/extract-info
3. Suggest Format
The /api/suggest-format endpoint suggests JSON formats for your text data.
POST https://dataclean.tech/api/suggest-format
Quick Start Guide
1. Sign up for an account and get your API key
2. You get free 10,000 tokens to test out the APIs
2. Purchase a subscription to get more tokens as needed
3. Make API requests using your key
API Key
All API requests require an API key for authentication. Your API key is tied to your subscription and token usage.
Getting Your API Key
1. Sign up for an account on DataClean
2. Navigate to your dashboard
3. Find your API key in the API section
Token Usage
Each API request consumes tokens based on the length of your input text. You can:
- Monitor your token usage in the dashboard
- Purchase additional tokens as needed
- Set up automatic top-ups
Security
Keep your API key secure and never share it publicly. If you believe your key has been compromised, you can generate a new one from your dashboard.
URL & Endpoints
DataClean API is accessible through HTTPS endpoints. All requests should be made using POST method with JSON data.
Base URL
https://dataclean.vercel.app
Available Endpoints
1. Extract text from pdf/docs using OCR and formatting
/api/file-extract
Required parameters:
- type: "link" or "file"
- attachment: URL or base64 encoded file content
- extraction: "text" or "json"
- apiKey: Your API key
- structure: JSON schema (required when extraction="json")
Supported files:
- PDF files (.pdf)
- Word documents (.docx)
- Google Drive files & Google Docs
- OneDrive links
Maximum file size: 50MB
2. Extract Information
/api/extract-info
Required parameters:
- text: Your input text
- format: Desired JSON format
- apiKey: Your API key
3. Suggest Format
/api/suggest-format
Required parameters:
- text: Your input text
- apiKey: Your API key
Response Format
All responses are returned in JSON format with the following structure:
{ "result": [/* extracted data or format suggestions */], "tokensUsed": number }
Example Code
Here are examples of how to use our API endpoints in different programming languages.
Examples for calling the /api/file-extract endpoint
Endpoint: POST /api/file-extract
Required Parameters:
- type: string ("link" | "file") - Source type of the document
- attachment: string - Either a URL or base64 encoded file content
- extraction: string ("text" | "json") - Desired output format
- apiKey: string - Your API authentication key
Optional Parameters:
- structure: object - Required when extraction="json", defines the JSON schema
- prompt: string - Custom extraction instructions
Supported File Types:
- PDF (.pdf)
- Word Documents (.docx)
File Size Limit: 50MB
Special URL Support:
- Google Drive files
- Google Docs
- OneDrive links
- Direct file URLs
Response Format:
{
"result": string | object,
"tokensUsed": number
}
Example Request Body:
{
"type": "link",
"attachment": "https://example.com/document.pdf",
"extraction": "json",
"structure": {
"name": "string",
"age": "number",
"occupation": "string"
},
"prompt": "Extract person information",
"apiKey": "your_api_key"
}
curl -X POST https://dataclean.tech/api/file-extract \
-H "Content-Type: application/json" \
-d '{
"type": "link",
"attachment": "https://example.com/document.pdf",
"extraction": "json",
"structure": {
"name": "string",
"age": "number",
"occupation": "string"
},
"prompt": "Extract person information",
"apiKey": "your_api_key"
}'
import requests
import base64
url = "https://dataclean.tech/api/file-extract"
headers = {
"Content-Type": "application/json"
}
# For URL/link based extraction
data = {
"type": "link",
"attachment": "https://example.com/document.pdf",
"extraction": "json",
"structure": {
"name": "string",
"age": "number",
"occupation": "string"
},
"prompt": "Extract person information",
"apiKey": "your_api_key"
}
For base64 file extraction
with open('document.pdf', 'rb') as file:
base64_file = base64.b64encode(file.read()).decode('utf-8')
data = {
"type": "file",
"attachment": base64_file,
"extraction": "json",
"structure": {
"name": "string",
"age": "number",
"occupation": "string"
},
"prompt": "Extract person information",
"apiKey": "your_api_key"
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print(result)
// For URL/link based extraction
fetch('https://dataclean.tech/api/file-extract', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
type: "link",
attachment: "https://example.com/document.pdf",
extraction: "json",
structure: {
name: "string",
age: "number",
occupation: "string"
},
prompt: "Extract person information",
apiKey: "your_api_key"
})
})
.then(response => response.json())
.then(data => console.log(data))
.catch(error => console.error('Error:', error));
// For base64 file extraction
const file = document.querySelector('input[type="file"]').files[0];
const reader = new FileReader();
reader.onload = function() {
const base64File = reader.result.split(',')[1];
fetch('https://dataclean.tech/api/file-extract', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
type: "file",
attachment: base64File,
extraction: "json",
structure: {
name: "string",
age: "number",
occupation: "string"
},
prompt: "Extract person information",
apiKey: "your_api_key"
})
})
.then(response => response.json())
.then(data => console.log(data));
};
reader.readAsDataURL(file);
<?php
$url = 'https://dataclean.tech/api/file-extract';
// For URL/link based extraction
$data = array(
'type' => 'link',
'attachment' => 'https://example.com/document.pdf',
'extraction' => 'json',
'structure' => array(
'name' => 'string',
'age' => 'number',
'occupation' => 'string'
),
'prompt' => 'Extract person information',
'apiKey' => 'your_api_key'
);
// For base64 file extraction
$file_content = file_get_contents('document.pdf');
$base64_file = base64_encode($file_content);
$data = array(
'type' => 'base64',
'attachment' => $base64_file,
'extraction' => 'json',
'structure' => array(
'name' => 'string',
'age' => 'number',
'occupation' => 'string'
),
'prompt' => 'Extract person information',
'apiKey' => 'your_api_key'
);
$options = array(
'http' => array(
'header' => "Content-type: application/json
",
'method' => 'POST',
'content' => json_encode($data)
)
);
$context = stream_context_create($options);
$result = file_get_contents($url, false, $context);
$response = json_decode($result);
print_r($response);
?>
Examples for calling the /api/extract-info endpoint
...
curl -X POST https://dataclean.tech/api/extract-info \
-H "Content-Type: application/json" \
-d '{
"text": "John Doe is 30 years old and works as a software engineer.",
"format": {
"name": "string",
"age": "number",
"occupation": "string"
},
"apiKey": "your_api_key"
}'
import requests
url = "https://dataclean.tech/api/extract-info"
headers = {
"Content-Type": "application/json"
}
data = {
"text": "John Doe is 30 years old and works as a software engineer.",
"format": {
"name": "string",
"age": "number",
"occupation": "string"
},
"apiKey": "your_api_key"
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print(result)
fetch('https://dataclean.tech/api/extract-info', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: "John Doe is 30 years old and works as a software engineer.",
format: {
name: "string",
age: "number",
occupation: "string"
},
apiKey: "your_api_key"
})
})
.then(response => response.json())
.then(data => console.log(data))
.catch(error => console.error('Error:', error));
<?php
$url = 'https://dataclean.tech/api/extract-info';
$data = array(
'text' => 'John Doe is 30 years old and works as a software engineer.',
'format' => array(
'name' => 'string',
'age' => 'number',
'occupation' => 'string'
),
'apiKey' => 'your_api_key'
);
$options = array(
'http' => array(
'header' => "Content-type: application/json
",
'method' => 'POST',
'content' => json_encode($data)
)
);
$context = stream_context_create($options);
$result = file_get_contents($url, false, $context);
$response = json_decode($result);
print_r($response);
?>
Examples for calling the /api/suggest-format endpoint
...
curl -X POST https://dataclean.tech/api/suggest-format \
-H "Content-Type: application/json" \
-d '{
"text": "John Doe is 30 years old and works as a software engineer.",
"apiKey": "your_api_key"
}'
import requests
url = "https://dataclean.tech/api/suggest-format"
headers = {
"Content-Type": "application/json"
}
data = {
"text": "John Doe is 30 years old and works as a software engineer.",
"apiKey": "your_api_key"
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print(result)
fetch('https://dataclean.tech/api/suggest-format', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: "John Doe is 30 years old and works as a software engineer.",
apiKey: "your_api_key"
})
})
.then(response => response.json())
.then(data => console.log(data))
.catch(error => console.error('Error:', error));
<?php
$url = 'https://dataclean.tech/api/suggest-format';
$data = array(
'text' => 'John Doe is 30 years old and works as a software engineer.',
'apiKey' => 'your_api_key'
);
$options = array(
'http' => array(
'header' => "Content-type: application/json
",
'method' => 'POST',
'content' => json_encode($data)
)
);
$context = stream_context_create($options);
$result = file_get_contents($url, false, $context);
$response = json_decode($result);
print_r($response);
?>