Index Your Data: Azure Cognitive Search – Part 3

Azure, Cloud Native

In this article series, you learn how to use the Azure Cognitive Search, prepare your application data to make it searchable, and improve your search results' performance and quality. In the first article, I introduced Azure Cognitive Search Services. In the second part, I demonstrated how to build an index that contains your searchable data. Now, you will see how to fill your index automatically from an Azure data source.

Article Series

Part 1: Introduction to Azure Cognitive Search
Part 2: The Search Index
Part 3: Index your Data ⬅
Part 4: Integrate Azure Cognitive Search Into Your Application

If you want to search your data, the index we created in the last article has to be filled with information first. Azure Cognitive Search offers different ways to bring data into an index. If you have your own data source, you have to push data into the index programmatically using, for instance, the .NET SDK or the REST API. However, if you have a data source on Azure, you are able to use indexers. Think of the indexer as a microservice that runs on Azure and pushes data into your index actively. This indexer can read data from different data sources like Azure Blob Storage or Azure SQL.

If you look at the architectural diagram from the introduction article, you can see how index and indexer rely on each other. The indexer grabs the data from the data storage and puts it into the index based on the defined fields.

Create an Indexer

Like the index, there are also different ways to create the indexer. Using the Azure Portal, you can create an indexer when importing data from an Azure Storage service. However, using the import wizard also forces you to create a new index. There is no way to create an indexer for an existing index in the Azure portal. Therefore, you have to use an SDK or the REST API. Building it in the Azure portal offers nearly the same options as creating the indexer with the SDK:

The only required parameter to create an indexer is the name, which has to be unique in your search service. There are additional parameters, like the schedule for execution or a description. For advanced configuration, there are more options like the parsing format or excluded items.

If you do not want to create the indexer in the portal, you can also do so in the .NET SDK in your own code. Before we start creating an indexer, we need an Azure data source where the indexer can access the data he has to index. To create a data source, the SearchServiceClient offers methods to create and delete data sources. With DataSources.CreateAsync we can pass a AzureBlobStorage which we created in the last article.

				
					// Delete data source if existing
if (await serviceClient.DataSources.ExistsAsync(searchIndexerConfig.DataSourceName))
		await serviceClient.DataSources.DeleteAsync(searchIndexerConfig.DataSourceName);

// Create data source
await serviceClient.DataSources.CreateAsync(
	DataSource.AzureBlobStorage(
		searchIndexerConfig.DataSourceName,
		"<AzureBlogStorageConnectionString>",
    "<IndexContainerName>"));

To create an indexer, you need a configuration that describes it. I created a Json file that includes all needed information to create an indexer that can fill our index with the data from the PokeApi that is stored in the Blobstorage we made in the last article.

				
					{
	// name of the index
  "name": "pokeapi-indexer", 
	// Azure data source
  "dataSourceName": "pokeapi-datasource", 
  // name of the index that has to be filled
  "targetIndexName": "pokeapi-index", 
  "disabled": null,
  "schedule": null,
  "parameters": {
    "configuration": {
	    // Specify which data will be used to fill the index
      "dataToExtract": "contentAndMetadata",
      // Format of the data that has to be parsed.
      "parsingMode": "json"
    }
  }
}

Defining the indexer allows us to specify which data has to be indexed and which parsingMode should be used. For the dataToExtract, we can specify that the content and the metadata should be indexed. But it is also possible to only use the metadata so the storage item’s content will not be used for filling the index. If you are using Azure Blob storage, you can also define the parsingMode if you want to use the content to fill the index. So it is possible to choose a parsing mode like JSON or only plain Text. The dataSourceName is the name of your Azure data source we have created earlier. This is necessary for the indexer to know where to get the data from.

Before you can work with indexers, you need a SearchServiceClient instance. This may be the same client object you have created to use the index (see “.NET SDK” in the index article).

				
					// Check if the indexer exists
if (await serviceClient.Indexers.ExistsAsync(indexerConfig.Name)) {
    // Delete the indexer
    await serviceClient.Indexers.DeleteAsync(indexerConfig.Name);
}

// Create the new indexer
await serviceClient.Indexers.CreateAsync(indexerConfig);

The SearchServiceClient has a property called Indexers, which you can use to access operations dedicated to indexers. Before creating a new indexer, we check if an indexer with the same name exists by calling the ExistsAsync method. If there is an existing indexer, we delete it before creating a new one. We create a new indexer by invoking the CreateAsync method, passing the indexer-configuration.

Start an Indexer-run

When you have created your indexer, you can schedule it to run periodically, or start the indexer manually. As with all other functions for the indexes or indexer, you can run the indexer in the Azure portal or with the REST API/SDK. For every indexer, you have an execution history in the Azure portal where you can see if the indexer was successful or has failed.

You can also see how many documents have been indexed in the last indexer runs. If you want to run your indexer in your own code, you just have to call the RunAsync method with the indexer name as a parameter. The method is available on the Indexers property on your SearchServiceClient instance.

				
					await serviceClient.Indexers.RunAsync(indexerConfig.Name);

Important to know:

If you run your indexer, an exception is thrown based on the lowest pricing tier if you run the indexer more often than every 180 seconds.

Based on your data source, the indexer might not delete data that is deleted on your data source. If you are using Blob Storage, the indexer does not recognize if a blob is deleted. It only registers new/updated blobs. So it is necessary to delete the index entry separately or to rebuild the entire index. In my demo application, I entirely recreate the index (delete -> create), so it is not necessary to delete the items.

I run my code in an Azure Function as an easy-to-get-started-API directly running in Azure. This function combines the import of the data from the PokeApi into the BlobStorage, creating the index and creating the indexer to fill the index. You can find the complete Azure function in the GitHub Repository.

Upcoming

We have filled the index that we created in the last article with data. Now we want to search through this data. In the next and final article, I will show you how you can query the search service from your JavaScript frontend and how to include the search into your application.

Current articles, screencasts and interviews by our experts

Don’t miss any content on Angular, .NET Core, Blazor, Azure, and Kubernetes and sign up for our free monthly dev newsletter.

Improved RAG: More effective Semantic Search with content transformations

One of the more pragmatic ways to get going on the current AI hype, and to get some value out of it, is by leveraging semantic search. This is, in itself, a relatively simple concept: You have a bunch of documents and want to find the correct one based on a given query. The semantic part now allows you to find the correct document based on the meaning of its contents, in contrast to simply finding words or parts of words in it like we usually do with lexical search. In our last projects, we gathered some experience with search bots, and with this article, I'd love to share our insights with you.

read article >

17.05.2024

| Sebastian Gingter

Angular

View Transition API Integration in Angular—a brave new world (Part 1)

If you previously wanted to integrate view transitions into your Angular application, this was only possible in a very cumbersome way that needed a lot of detailed knowledge about Angular internals. Now, Angular 17 introduced a feature to integrate the View Transition API with the router. In this two-part series, we will look at how to leverage the feature for route transitions and how we could use it for single-page animations.

read article >

15.04.2024

| Sascha Lehmann

.NET

Data Access in .NET Native AOT with Sessions

.NET 8 brings Native AOT to ASP.NET Core, but many frameworks and libraries rely on unbound reflection internally and thus cannot support this scenario yet. This is true for ORMs, too: EF Core and Dapper will only bring full support for Native AOT in later releases. In this post, we will implement a database access layer with Sessions using the Humble Object pattern to get a similar developer experience. We will use Npgsql as a plain ADO.NET provider targeting PostgreSQL.

read article >

15.11.2023

| Kenny Pflug

Index Your Data: Azure Cognitive Search – Part 3

In this article:

Create an Indexer

Start an Indexer-run

Upcoming

Current articles, screencasts and interviews by our experts

Improved RAG: More effective Semantic Search with content transformations

View Transition API Integration in Angular—a brave new world (Part 1)

Data Access in .NET Native AOT with Sessions