Create Structured Index
Structured indexes in Marqo are tailored for datasets with a defined schema and are particularly effective for complex queries like sorting, grouping, and filtering. They are designed for fast, in-memory operations.
By default, the settings for structured index look like this. Settings can be set as the index is created.
POST /indexes/{index_name}
Create index with (optional) settings.
This endpoint accepts the application/json
content type.
Path parameters
Name | Type | Description |
---|---|---|
index_name |
String | Name of the index |
Body Parameters
The settings for the index are represented as a nested JSON object that contains the default settings for the index. The parameters are as follows:
Name | Type | Default value | Description |
---|---|---|---|
allFields |
List | - |
List of fields that might be indexed or queried. Valid only if type is structured |
tensorFields |
List | [] |
List of fields that are treated as tensors |
model |
String | hf/e5-base-v2 |
The model to use to vectorise doc content in add_documents() calls for the index |
modelProperties |
Dictionary | "" |
The model properties object corresponding to model (for custom models) |
normalizeEmbeddings |
Boolean | true |
Normalize the embeddings to have unit length |
textPreprocessing |
Dictionary | "" |
The text preprocessing object |
imagePreprocessing |
Dictionary | "" |
The image preprocessing object |
videoPreprocessing |
Dictionary | "" |
The video preprocessing object |
audioPreprocessing |
Dictionary | "" |
The audio preprocessing object |
annParameters |
Dictionary | "" |
The ANN algorithm parameter object |
type |
String | unstructured |
Type of the index. The default value is unstructured , but for the structured index this needs to be structured |
vectorNumericType |
String | float |
Numeric type for vector encoding |
Note: these body parameters are used in both Marqo Open-Source and Marqo Cloud. Marqo Cloud also has additional body parameters. Let's take a look at those now.
Additional Marqo Cloud Body Parameters
Marqo Cloud creates dedicated infrastructure for each index. Using the create index endpoint, you can specify the type of storage for the index storageClass and the type of inference inferenceType. The number of storage instances is defined by numberOfShards, the number of replicas numberOfReplicas and the number of Marqo inference nodes by numberOfInferences. This is only supported for Marqo Cloud, not Marqo Open-Source.
Name | Type | Default value | Description | Open Source | Cloud |
---|---|---|---|---|---|
inferenceType |
String | marqo.CPU.small |
Type of inference for the index. Options are "marqo.CPU.small"(deprecated), "marqo.CPU.large", "marqo.GPU". | ❌ | ✅ |
storageClass |
String | marqo.basic |
Type of storage for the index. Options are "marqo.basic", "marqo.balanced", "marqo.performance". | ❌ | ✅ |
numberOfShards |
Integer | 1 |
The number of shards for the index. | ❌ | ✅ |
numberOfReplicas |
Integer | 0 |
The number of replicas for the index. | ❌ | ✅ |
numberOfInferences |
Integer | 1 |
The number of inference nodes for the index. | ❌ | ✅ |
Fields
The allFields
object contains the fields that might be indexed or queried. Each field has the following parameters:
Name | Type | Default value | Description |
---|---|---|---|
name |
String | - |
Name of the field |
type |
String | - |
Type of the field |
features |
List | [] |
List of features that the field supports |
Available types are:
Field Type | Description | Supported Features |
---|---|---|
text |
Text field | lexical_search , filter |
int |
32-bit integer | filter , score_modifier |
float |
32-bit float | filter , score_modifier |
long |
64-bit integer | filter , score_modifier |
double |
64-bit float | filter , score_modifier |
array<text> |
Array of text | lexical_search , filter |
array<int> |
Array of 32-bit integers | filter |
array<float> |
Array of 32-bit floats | filter |
array<long> |
Array of 64-bit integers | filter |
array<double> |
Array of 64-bit floats | filter |
bool |
Boolean | filter |
multimodal_combination |
Multimodal combination field | None |
image_pointer |
Image URL. Must only be used with a multimodal model such as CLIP | None |
video_pointer |
Video URL. Must only be used with a multimodal model such as LanguageBind | None |
audio_pointer |
Audio URL. Must only be used with a multimodal model such as LanguageBind | None |
custom_vector |
Custom vector, with optional text for lexical/filtering | lexical_search , filter |
map<text, int> |
Map of text to integers | score_modifier |
map<text, long> |
Map of text to longs | score_modifier |
map<text, float> |
Map of text to floats | score_modifier |
map<text, double> |
Map of text to doubles | score_modifier |
Available features are:
lexical_search
: The field can be used for lexical searchfilter
: The field can be used for exact and range (numerical fields) filteringscore_modifier
: The field can be used to modify the score of the document
When using multimodal_combination
fields, the dependentFields
object is used to define the weights for the
multimodal combination field and is required. The dependentFields
object is a dictionary where the keys are the names
of the fields that are used to create the multimodal combination field and the values are the weights for each field.
Field names must refer to fields that are defined in allFields
. See the example below for more details.
Text Preprocessing Object
The textPreprocessing
object contains the specifics of how you want the index to preprocess text. The parameters are
as follows:
Name | Type | Default value | Description |
---|---|---|---|
splitLength |
Integer | 2 |
The length of the chunks after splitting by split_method |
splitOverlap |
Integer | 0 |
The length of overlap between adjacent chunks |
splitMethod |
String | sentence |
The method by which text is chunked (character , word , sentence , or passage ) |
Image Preprocessing Object
The imagePreprocessing
object contains the specifics of how you want the index to preprocess images. The parameters
are as follows:
Name | Type | Default value | Description |
---|---|---|---|
patchMethod |
String | null |
The method by which images are chunked (simple or frcnn ) |
Video Preprocessing Object
The videoPreprocessing
object contains the specifics of how you want the index to preprocess videos. The last chunk in the video file will have a start
time of the total length of the video file minus the split length.
The parameters are as follows:
Name | Type | Default value | Description |
---|---|---|---|
splitLength |
Integer | 20 |
The length of the video chunks in seconds after splitting by split_method |
splitOverlap |
Integer | 3 |
The length of overlap in seconds between adjacent chunks |
Audio Preprocessing Object
The audioPreprocessing
object contains the specifics of how you want the index to preprocess audio. The last chunk in the audio file will have a start
time of the total length of the audio file minus the split length.
The parameters are as follows:
Name | Type | Default value | Description |
---|---|---|---|
splitLength |
Integer | 20 |
The length of the video chunks in seconds after splitting by split_method |
splitOverlap |
Integer | 3 |
The length of overlap in seconds between adjacent chunks |
ANN Algorithm Parameter object
The annParameters
object contains hyperparameters for the approximate nearest neighbour algorithm used for tensor
storage within Marqo. The parameters are as follows:
Name | Type | Default value | Description |
---|---|---|---|
spaceType |
String | prenormalized-angular |
The function used to measure the distance between two points in ANN (angular , euclidean , dotproduct , geodegrees , hamming , or prenormalized-angular ). |
parameters |
Dict | "" |
The hyperparameters for the ANN method (which is always hnsw for Marqo). |
HNSW Method Parameters Object
parameters
can have the following values:
Name | Type | Default value | Description |
---|---|---|---|
efConstruction |
int | 512 |
The size of the dynamic list used during k-NN graph creation. Higher values lead to a more accurate graph but slower indexing speed. It is recommended to keep this between 2 and 800 (maximum is 4096) |
m |
int | 16 |
The number of bidirectional links that the plugin creates for each new element. Increasing and decreasing this value can have a large impact on memory consumption. Keep this value between 2 and 100. |
Model Properties Object
This flexible object, used by modelProperties
is used to set up models that aren't available in Marqo by default (
models available by default are listed here).
The structure of this object will vary depending on the model.
For Open CLIP models, see here
for modelProperties
format and example usage.
For Generic SBERT models, see here
for modelProperties
format and example usage.
Example 1: Creating a structured index for combining text and images
cURL -X POST 'http://localhost:8882/indexes/my-first-structured-index' \
-H "Content-Type: application/json" \
-d '{
"type": "structured",
"vectorNumericType": "float",
"model": "open_clip/ViT-B-32/laion2b_s34b_b79k",
"normalizeEmbeddings": true,
"textPreprocessing": {
"splitLength": 2,
"splitOverlap": 0,
"splitMethod": "sentence"
},
"allFields": [
{"name": "text_field", "type": "text", "features": ["lexical_search"]},
{"name": "caption", "type": "text", "features": ["lexical_search", "filter"]},
{"name": "tags", "type": "array<text>", "features": ["filter"]},
{"name": "image_field", "type": "image_pointer"},
{"name": "my_int", "type": "int", "features": ["score_modifier"]},
{
"name": "multimodal_field",
"type": "multimodal_combination",
"dependentFields": {"image_field": 0.9, "text_field": 0.1}
}
],
"tensorFields": ["multimodal_field"],
"annParameters": {
"spaceType": "prenormalized-angular",
"parameters": {"efConstruction": 512, "m": 16}
}
}'
import marqo
settings = {
"type": "structured",
"vectorNumericType": "float",
"model": "open_clip/ViT-B-32/laion2b_s34b_b79k",
"normalizeEmbeddings": True,
"textPreprocessing": {
"splitLength": 2,
"splitOverlap": 0,
"splitMethod": "sentence",
},
"imagePreprocessing": {"patchMethod": None},
"allFields": [
{"name": "text_field", "type": "text", "features": ["lexical_search"]},
{"name": "caption", "type": "text", "features": ["lexical_search", "filter"]},
{"name": "tags", "type": "array<text>", "features": ["filter"]},
{"name": "image_field", "type": "image_pointer"},
{"name": "my_int", "type": "int", "features": ["score_modifier"]},
# this field maps the above image field and text fields into a multimodal combination.
{
"name": "multimodal_field",
"type": "multimodal_combination",
"dependentFields": {"image_field": 0.9, "text_field": 0.1},
},
],
"tensorFields": ["multimodal_field"],
"annParameters": {
"spaceType": "prenormalized-angular",
"parameters": {"efConstruction": 512, "m": 16},
},
}
mq = marqo.Client(url="http://localhost:8882", api_key=None)
mq.create_index("my-first-structured-index", settings_dict=settings)
cURL -X POST 'https://api.marqo.ai/api/v2/indexes/my-first-index' \
-H 'x-api-key: XXXXXXXXXXXXXXX' \
-H "Content-Type: application/json" \
-d '{
"type": "structured",
"vectorNumericType": "float",
"model": "open_clip/ViT-B-32/laion2b_s34b_b79k",
"normalizeEmbeddings": true,
"textPreprocessing": {
"splitLength": 2,
"splitOverlap": 0,
"splitMethod": "sentence"
},
"allFields": [
{"name": "text_field", "type": "text", "features": ["lexical_search"]},
{"name": "caption", "type": "text", "features": ["lexical_search", "filter"]},
{"name": "tags", "type": "array<text>", "features": ["filter"]},
{"name": "image_field", "type": "image_pointer"},
{"name": "my_int", "type": "int", "features": ["score_modifier"]},
{
"name": "multimodal_field",
"type": "multimodal_combination",
"dependentFields": {"image_field": 0.9, "text_field": 0.1}
}
],
"tensorFields": ["multimodal_field"],
"annParameters": {
"spaceType": "prenormalized-angular",
"parameters": {"efConstruction": 512, "m": 16}
},
"numberOfShards": 1,
"numberOfReplicas": 0,
"inferenceType": "marqo.CPU.large",
"storageClass": "marqo.basic",
"numberOfInferences": 1
}'
import marqo
settings = {
"type": "structured",
"vectorNumericType": "float",
"model": "open_clip/ViT-B-32/laion2b_s34b_b79k",
"normalizeEmbeddings": True,
"textPreprocessing": {
"splitLength": 2,
"splitOverlap": 0,
"splitMethod": "sentence",
},
"imagePreprocessing": {"patchMethod": None},
"allFields": [
{"name": "text_field", "type": "text", "features": ["lexical_search"]},
{"name": "caption", "type": "text", "features": ["lexical_search", "filter"]},
{"name": "tags", "type": "array<text>", "features": ["filter"]},
{"name": "image_field", "type": "image_pointer"},
{"name": "my_int", "type": "int", "features": ["score_modifier"]},
# this field maps the above image field and text fields into a multimodal combination.
{
"name": "multimodal_field",
"type": "multimodal_combination",
"dependentFields": {"image_field": 0.9, "text_field": 0.1},
},
],
"tensorFields": ["multimodal_field"],
"annParameters": {
"spaceType": "prenormalized-angular",
"parameters": {"efConstruction": 512, "m": 16},
},
"numberOfShards": 1,
"numberOfReplicas": 0,
"inferenceType": "marqo.CPU.large",
"storageClass": "marqo.basic",
"numberOfInferences": 1,
}
mq = marqo.Client("https://api.marqo.ai", api_key="XXXXXXXXXXXXXXX")
mq.create_index("my-first-structured-index", settings_dict=settings)
Example 2: Creating a structured index with no model for use with custom vectors
cURL -X POST 'http://localhost:8882/indexes/my-hybrid-index' \
-H "Content-Type: application/json" \
-d '{
"model": "no_model",
"modelProperties": {
"type": "no_model",
"dimensions": 3072
},
"type": "structured",
"allFields": [
{"name": "title", "type": "custom_vector", "features": ["lexical_search"]},
{"name": "description", "type": "text", "features": ["lexical_search", "filter"]},
{"name": "time_added_epoch", "type": "int", "features": ["score_modifier"]}
],
"tensorFields": ["title"]
}'
import marqo
mq = marqo.Client("http://localhost:8882", api_key=None)
mq.create_index(
index_name="my-hybrid-index",
type="structured",
model="no_model",
model_properties={"type": "no_model", "dimensions": 3072},
all_fields=[
{"name": "title", "type": "custom_vector", "features": ["lexical_search"]},
{
"name": "description",
"type": "text",
"features": ["lexical_search", "filter"],
},
{"name": "time_added_epoch", "type": "float", "features": ["score_modifier"]},
],
tensor_fields=["title"],
)
cURL -X POST 'https://api.marqo.ai/api/v2/indexes/my-first-index' \
-H 'x-api-key: XXXXXXXXXXXXXXX' \
-H "Content-Type: application/json" \
-d '{
"model": "no_model",
"modelProperties": {
"type": "no_model",
"dimensions": 3072
},
"type": "structured",
"allFields": [
{"name": "title", "type": "custom_vector", "features": ["lexical_search"]},
{"name": "description", "type": "text", "features": ["lexical_search", "filter"]},
{"name": "time_added_epoch", "type": "int", "features": ["score_modifier"]}
],
"tensorFields": ["title"],
"numberOfShards": 1,
"numberOfReplicas": 0,
"inferenceType": "marqo.CPU.large",
"storageClass": "marqo.basic",
"numberOfInferences": 1
}'
import marqo
mq = marqo.Client("https://api.marqo.ai", api_key="XXXXXXXXXXXXXXX")
mq.create_index(
index_name="my-hybrid-index",
type="structured",
model="no_model",
model_properties={"type": "no_model", "dimensions": 3072},
all_fields=[
{"name": "title", "type": "custom_vector", "features": ["lexical_search"]},
{
"name": "description",
"type": "text",
"features": ["lexical_search", "filter"],
},
{"name": "time_added_epoch", "type": "float", "features": ["score_modifier"]},
],
tensor_fields=["title"],
)