- Lab
- A Cloud Guru
Apply OCR to Search Scanned Documents in Azure AI Search
In this hands-on experience, you will get a chance to develop an AI-enhanced Azure AI Search solution, including creating the search service, retrieving a variety of image types from an external source, importing that data to generate a search index, and adding a skillset, powered by the Azure AI Vision service to perform OCR (optical character recognition) over the images. You will validate your work by searching the data using the JSON query editor in the Search Explorer, and comparing the results against your observations in the images. All work will take place in the Azure portal, and no coding is required to create the resources and test your solution. However, note that you will need to be allowed to download a few image files into your local environment in order to complete the lab objectives.
Path Info
Table of Contents
-
Challenge
Create an Azure AI Search Service
You should already be logged into the Azure portal, using the credentials provided with the lab. When you first log in to the Azure Portal, you will land on the overview page for the resource group already deployed for you. Note the location/region for the resource group and note the Azure Storage account already deployed for you, with a name that starts with, "labstorage...".
Using the Azure portal, create an Azure AI Search service with the following configuration:
- Create the service in the existing resource group, and in the same location as that resource group.
- Use any valid name you choose.
- Ensure the location matches the location of the resource group for the lab.
- Create the service in the Basic pricing tier. Do not choose the Free tier, as only one is allowed per subscription, and your lab environment is on a shared subscription with other students.
-
Challenge
Import External Data and Add Skillset to Perform OCR
In this objective, you will complete the tasks required to import sample data and build an index on your search service, based on that data. The indexer that populates the index will also include an AI Vision set of skills in a skillset tag and caption images to make them searchable.
Retrieve and store sample data
- You should have already created an Azure AI Search service. Navigate to the newly deployed service, and on the "Overview" select the Resource Group to navigate back to the resource group that was set up for you and that was the landing page when you first logged into the lab.
- Navigate to the Azure Storage resource already created for you and navigate to the blob container called lab-container.
- In a separate browser window, navigate to the GitHub repository link provided in the "Additional Information and Resources" section of this lab.
- You should see several files in the OCR folder. Download all files to your local environment, and then return to the storage account in the Azure portal and upload both files to the blob container called lab-container.
- Use the breadcrumb trail to return to the resource group Overview page. Navigate to the newly deployed Azure AI Search service.
Import data and configure index and skillset
On the "Overview" page for the search service, select "Import Data" to kick off the wizard. Let the wizard guide you through the process, including the following specific details and properties.
- Choose to import data from Azure Blob Storage. Set up the new Data Source with any name you prefer, include all content and metadata, and set the parsing mode to Default.
- Select the pre-existing storage account and lab-container. You will not need to specifically reference files in the container; data will be pulled from all valid sources of data in the selected blob storage container.
- Under Add cognitive skills, Attach AI Services, confirm that the AI Services Resource Name is "Free (Limited enrichments)." This setting determines the compute resource that will be used to power the AI enrichment(s) you select. The free option allows you to use non-billed resources instead of setting up and paying for compute and storage — on an Azure AI multi-service resource.
- Under Add Enrichments, check the box to apply Enable OCR and merge all text into merged_content field.
- All other features on the Add cognitive skills tab can remain at their defaults.
- Under Customize target index, familiarize yourself with the fields extracted from the source data and the fields generated to capture the OCR text. For simplicity, make all fields Retrieveable and Searchable, including the newly added fields from the skillset.
- Move on to the Create an indexer and Submit to create the index, the skillset, and the indexer, which will kick off the first run of the indexer.
- When the indexer run is complete, navigate to the index populated by the indexer and note the number of JSON documents created and the storage size. There should be 6 documents, each containing the data for each image in the source data.
Tip: If the UI Indicates 0 Documents and 0 Bytes
If the UI in the index screen appears to indicate that there are no documents, make sure the indexer has completed running. However, there may also be a quirk in the UI. You can also perform a quick query by putting an asterisk (*) in the search bar and running the query. If the query returns documents, the UI just hasn't caught up with the underlying data statistics; you can proceed with the next objective.
-
Challenge
Compare OCR Output to Visual Observation of Document Contents
Context: For this objective, you should already be on the index page for your search index, which defaults to the Search Explorer pane.
Browse the index data with Search Explorer
-
Using the Search Explorer, browse the index to examine the output of the data source data — especially the
merged_content
field.Tips: To see all 6 documents, simply execute a search with nothing or a
*
in the search bar. Use the bottom, horizontal scroll bar to view the simple caption generated for each image. Better yet, copy-paste the search output to a text editor that provides word-wrapping for easier viewing. -
Open the images on your desktop or in the GitHub user interface to compare the text in the scanned images to the OCR output. Think, first, about the usefulness of the output regarding typical search scenarios, which often don't require as much precision in the OCR output. The search application will return the image, itself, to the user and not necessarily the OCR text. Then consider what other AI-driven enhancements could be added to the skillset to make the data even more useful for Azure AI Search use cases and/or for extended use cases.
-
Click on the search service name in the breadcrumb trail at the top of the page.
-
Click on Skillsets and click on the skillset you created to learn more about building more complex AI-driven skillsets.
-
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the author’s guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.