// 15 Essential Azure Data Factory Interview Questions
Azure Data Factory is a Microsoft cloud-based solution that captures and turns raw, unorganized data into meaningful information. It's an ETL (extract, transform, and load) data integration service that automates the transformation of given raw data. In this post, we will look at the most common questions asked during Azure Data Factory employment interviews.
Looking for Freelance Azure Data Factory Engineers? Build your product with Flexiple's top-quality talent.
Hire a Azure Data Factory EngineerHire NowData can be in any format because it comes from various sources. Such a wide range of sources will transfer or channel the data in a variety of ways, and the same can be in a mixed format. Therefore, it’s critical to ensure that this data is efficiently managed whenever we send it to the cloud or a specific storage location. As a result, you must transform the data and eliminate any redundant information before storing it.
Azure Data Factory is a cloud-based integration solution that allows you to create data-driven processes for organizing and automating data transportation and transformation in the cloud.
You can use Azure Data Factory to construct and plan data-driven processes (also known as pipelines) that may consume data from various sources. Then, it can process and transform the data using compute services such as HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.
Note: Important Azure Data Factory interview questionThis can be done by creating a new Data Factory instance using ADF's copy data tool.
Creating Data Factory instances is quite a simple task. First, you must ensure that the user account in Azure is a member, either as a contributor, owner, or administrator of the Azure subscription. You can use a general-purpose Azure Storage account (specifically Blob storage) as both source and destination datastore. You will also need the name of your Azure Storage account.
Now, we will create a blob container, but before that, you have to ensure that the two folders involved are in the container you just created. Otherwise, create the relevant folders and upload the required files.
Next, you need to create a data factory by going to the Azure Portal. To do this follow these steps:
- Go to the home page of Azure Data Factory and start the Copy Data tool.
- Next, create a new connection using Azure Blob storage.
- Select the file from your directory and also select where you want to copy the files in your blob storage container for the destination datastore.
- And you're done!
Note: It is considered good practice to select Monitor on the complete deployment page to monitor the pipeline you created.
You can create a schedule trigger to schedule a pipeline to run periodically (hourly, daily, etc.).
When creating a schedule trigger, you specify a schedule (start date, recurrence, end date, etc.) for the trigger and associate it with a pipeline. Pipelines and triggers have a many-to-many relationship. This means multiple triggers can kick off a single pipeline, and a single trigger can kick off multiple pipelines.
For this, you have to follow the following steps:
- Switch to the Edit tab in Data Factory or the Integrate tab in Azure Synapse
- Select Trigger on the menu
- Go to Add Triggers and select new.
- Then select settings as per your requirements.
- Select “Publish all” to publish the changes. The trigger doesn't start triggering the pipeline runs until you publish the changes.
Alternatively, you can even use Azure PowerShell to create, start, and monitor a schedule trigger. For this, the very first step is to create a JSON file:
{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Minute",
"interval": 15,
"startTime": "2017-12-08T00:00:00Z",
"endTime": "2017-12-08T01:00:00Z",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "Adfv2QuickStartPipeline"
},
"parameters": {
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/output"
}
}
]
}
}
The JSON file is then stored in the C:\ADFv2QuickStartPSH\ folder.
Then, you can create a trigger by using the Set-AzDataFactoryV2Trigger cmdlet:
Set-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name "MyTrigger" -DefinitionFile "C:\ADFv2QuickStartPSH\MyTrigger.json"
Note: Important Azure Data Factory interview questionYes, users can specify the input and output format using data factory functions and system variables. For example:
{
"Type": "SqlSource",
"sqlReaderQuery": "$$Text.Format('SELECT * FROM MyTable WHERE StartTime = \\'{0:yyyyMMdd-HH}\\'', WindowStart)"
}
In the above example, the sqlReaderQuery property is assigned a value returned by the Text.Format function. The example also uses a WindowStart variable that specifies the start time of the activity in the running window.
The Azure Data Factory integration runtime is the compute infrastructure that enables the following data integration capabilities across a variety of network environments:
- Data movement: The integration runtime handles data movement by moving data between the source and destination data stores while also supporting built-in connectors, format conversion, column mapping, and fast and scalable data transfer.
- Data flow: You can execute a Data Flow in a managed Azure compute environment for data flow.
- Dispatch activities: The integration runtime supports the native execution of SSIS packages for transformation.
- Execute SSIS packages: The integration runtime natively executes SSIS packages in a managed Azure compute environment. The integration runtime also supports dispatching and monitoring transformation activities running on Azure HDInsight, Azure Machine Learning, SQL Database, and SQL Server, among other compute services.
To move and transform data, you can use one or more instances of the integration runtime. The integration runtime can run on either a public or private Azure network (on-premises, Azure Virtual Network, or Amazon Web Services virtual private cloud [VPC]).
The management hub as well as any activities, datasets, or data flows that reference them can be used to create integration runtimes in the Azure Data Factory UX.
Note: Commonly asked Azure Data Factory interview questionYes, it is possible to create a send email workflow using Azure Data Factory. For this, you will need both an Azure subscription and a Logic App to trigger sending an email in the pipeline.
1. Create the email workflow in your Logic App
First, you need to create an email workflow. To do this, you can use the Logic App to create a sample workflow named SendEmailFromPipeline. Next, you define a workflow trigger, say When an HTTP request is received, and add an action of Office 365 Outlook - Send an email (V2).
Now, you define an HTTP request trigger. Here is an example JSON for the Request Body JSON Schema:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}
Finally, you can define the Send an email (V2) action. Using the Request Body JSON Schema, you can customize how you wish to format the email. Save this workflow, browse to find the workflow URL, and note it down for the next step.
2. Create a pipeline to trigger your Logic App email workflow
Now that you’ve created the Logic App workflow, you can trigger it using web activity.
- First, a new pipeline needs to be created. You will then find the Web activity under the General category. Next, drag it onto the editing canvas. This will create a new activity called Web1.
- Next, select the new Web1 activity and select the Settings tab. Here, add-in your workflow URL that was copied in the previous step and into the following example JSON:
You can also use dynamic expressions to generate different messages for events in your pipelines.{ "message": "This is a custom dynamic message from your pipeline with run ID @{pipeline().RunId}.", "dataFactoryName" : "@{pipeline().DataFactory}", "pipelineName" : "@{pipeline().Pipeline}", "receiver" : "@{pipeline().parameters.receiver}" }
- Next, select the background area of the pipeline designer to select the pipeline properties page. Here, you will add a new parameter called a receiver, where you will provide an email address as a default value.
- Finally, you can publish your pipeline, and just as a test, trigger it manually to confirm whether the email is sent as expected.
ADF supports 3 types of triggers: schedule trigger, tumbling window trigger, and event-based trigger.
The scheduling trigger is a popular trigger that allows you to run a Data Pipeline on a predefined schedule. It provides numerous options and allows you to schedule triggers in minute(s), hour(s), day(s), week(s), or month intervals(s). Furthermore, you can specify the start and end dates to make the trigger run only for that period. Additionally, the Schedule Trigger can be set to run on future calendar days and hours.
The Tumbling Window Trigger runs Data Pipelines at a predetermined time interval or at a specific time slice. When copying or migrating historical data, it is substantially more advantageous than Schedule Triggers.
The Event-based Azure Data Factory Trigger runs Data Pipelines in response to blob-related events, such as generating or deleting a blob file present in an Azure Blob Storage. With the Event-based Triggers, you can schedule the Data Pipelines to execute in response to an event from Azure Blob Storage.
In addition, Event-based Triggers are not only compatible with blobs but also with ADLs. Similar to Schedule Triggers, Event Triggers can also work on many-to-many relationships. A single Event Trigger can run several Pipelines, and multiple Event Triggers can run a single Pipeline.
Mapping Data Flows
Data can be visually transformed using Mapping Data Flows without writing any code. You can concentrate on the transformations and logic while Azure Data Factory takes care of the heavy lifting. It converts your logic and transformations into code that runs on scaled-out Azure Databricks clusters for maximum performance.
Wrangling Data Flows
Using the Power Query Online mashup editor, Wrangling Data Flows are used to explore and prepare datasets visually. You can concentrate on the modeling and logic while Azure Data Factory takes care of the heavy lifting. It converts the underlying M code to Spark code that runs in a managed environment for maximum performance.
Key differences between Mapping and Wrangling Data Flows
Both Mapping and Wrangling Data Flows have some overlapping features. However, there are a few key differences:
- Transformations:
With Mapping and Wrangling, you can do many of the same transformations, including filtering rows, merging/joining datasets, sorting, grouping, and adding and renaming columns. However, Mapping Data Flows can also handle inserts, delete, updates and upserts.
- Schema Drift:
Mapping Data Flows can be used to handle schema drift if your source frequently changes. For example, if some columns are removed or added, the destination can be updated automatically to exclude or include those new columns. However, you have to make these changes manually in Wrangling Data Flows.
- File and Table Handling:
File management, such as transferring files after they've been read, is built into Mapping Data Flows. During execution, you can also select to recreate sink tables. This eliminates the need to construct and execute T-SQL scripts before data loading.
First of all, we create a data factory using simple C# code. Then we have to create Azure storage linked service by adding the following C# code to the main method:
// Create an Azure Storage linked service
Console.WriteLine("Creating linked service " + storageLinkedServiceName + "...");
LinkedServiceResource storageLinkedService = new LinkedServiceResource(
new AzureStorageLinkedService
{
ConnectionString = new SecureString(
"DefaultEndpointsProtocol=https;AccountName=" + storageAccount +
";AccountKey=" + storageKey
)
}
);
client.LinkedServices.CreateOrUpdate(
resourceGroup, dataFactoryName, storageLinkedServiceName, storageLinkedService
);
Console.WriteLine(
SafeJsonConvert.SerializeObject(storageLinkedService, client.SerializationSettings)
);
This is followed by creating an Azure SQL Database linked service by adding this code to the main method:
// Create an Azure SQL Database linked service
Console.WriteLine("Creating linked service " + sqlDbLinkedServiceName + "...");
LinkedServiceResource sqlDbLinkedService = new LinkedServiceResource(
new AzureSqlDatabaseLinkedService
{
ConnectionString = new SecureString(azureSqlConnString)
}
);
client.LinkedServices.CreateOrUpdate(
resourceGroup, dataFactoryName, sqlDbLinkedServiceName, sqlDbLinkedService
);
Console.WriteLine(
SafeJsonConvert.SerializeObject(sqlDbLinkedService, client.SerializationSettings)
);
You will now create a dataset for source Azure Blob as you define a dataset that represents the source data in Azure Blob, followed by creating a dataset for sink Azure SQL Database.
From here, things get relatively simple as we only have to create a pipeline with a copy activity, which takes in the Blob dataset as the source and the SQL dataset as the sink. Then we only have to monitor the pipeline and activity runs to ensure successful completion!
Note: Commonly asked Azure Data Factory interview questionIn this case, the “Allow Azure services and resources to access this server” option on the “Firewall and virtual networks” page should be selected.
Doing this permits Azure services to access SQL Database. It ensures that you allow access to Azure services in your server so that the Data Factory service can write data to SQL Database.
Data Factory only stores pipeline run data for 45 days. When you query programmatically for data about Data Factory pipeline runs, there are no maximum dates for the optional LastUpdatedAfter and LastUpdatedBefore parameters.
A pipeline run has different statuses during its lifecycle, such as queued, InProgress, succeeded, failed, canceling, and canceled.
Now, we can monitor a pipeline run using .NET, Python, REST API, and the Azure Az Powershell.
Let’s take the example code of Python:
# Monitor the pipeline run
time.sleep(30)
pipeline_run = adf_client.pipeline_runs.get(
rg_name, df_name, run_response.run_id)
print("\n\tPipeline run status: {}".format(pipeline_run.status))
filter_params = RunFilterParameters(
last_updated_after=datetime.now() - timedelta(1), last_updated_before=datetime.now() + timedelta(1))
query_response = adf_client.activity_runs.query_by_pipeline_run(
rg_name, df_name, pipeline_run.run_id, filter_params)
print_activity_run_details(query_response.value[0])
Note: Commonly asked Azure Data Factory interview questionWhile trying to extract some data from the Azure SQL server database, if something has to be processed, it will be processed and stored in the Data Lake Store.
For Creating an ETL process, one should follow these steps:
- First, create a Linked Service (which is an SQL Server Database) for the source data store.
- Assume that we have a car dataset.
- Next, create a Linked Service (which is an Azure Data Lake Store) for the destination datastore.
- Create a dataset for Data Saving.
- Create the pipeline and add copy activity.
- Schedule the pipeline by adding a trigger.
Built on Azure Blob Storage, Azure Data Lake Storage Gen2 is a suite of capabilities targeted for big data analytics. The functionalities of Azure Data Lake Storage Gen1 and Azure Blob Storage are combined in Data Lake Storage Gen2.
There are two levels of security in ALDS Gen2:
Access Control by Role (RBAC)
Built-in Azure roles such as reader, contributor, owner, and custom roles are included in RBAC. RBAC is typically assigned for two reasons:- One is to identify who will be in charge of the service (i.e., update settings and properties for the storage account).
- Another purpose is to allow users to use built-in data explorer tools that require reader access.
Access Control List (ACL)
ACLs define which data items a user can read, write, or execute (execute is required to browse the directory structure). ACLs are POSIX-compliant. Therefore people with Unix or Linux experience will be familiar with them.The code actually defines a linked service in JSON format that can allow you to establish a connection with your data stores.
The code has 4 properties: name, type, typeProperties, and connectVia.
As seen in the JSON, the name property holds the name of the linked service.
type contains the type of the linked service. For example, AzureBlobStorage (data store) or AzureBatch (compute).
Properties in the typeProperties depend on the type of activity, thus the typeProperties will be different for each data store or compute activity. The example above is only being utilized as storage for a connection string to the AzureBlobStorage.
connectVia stores the Integration Runtime to be used to connect to the data store. You can use Azure Integration Runtime or Self-hosted Integration Runtime (if your data store is located in a private network). If not specified, it uses the default Azure Integration Runtime.
This is the code that you can add to the main method to create an Azure blob dataset.
You create a dataset that represents the data that needs to be copied from a source to a sink. However, this example is incomplete and will first require the creation of a linked service to the blobDataset:
LinkedServiceResource storageLinkedService = new LinkedServiceResource(
new AzureStorageLinkedService
{
ConnectionString = new SecureString(
"DefaultEndpointsProtocol=https;AccountName=" + storageAccount +
";AccountKey=" + storageKey)
}
);
client.LinkedServices.CreateOrUpdate(
resourceGroup, dataFactoryName, storageLinkedServiceName, storageLinkedService);
Console.WriteLine(SafeJsonConvert.SerializeObject(
storageLinkedService, client.SerializationSettings));
After creating a linked service, the code in question is run which relates it to the Azure Storage we have just created.
The dataset has a parameter whose value is set in a dataset-consuming operation. The argument is then used to create the folderPath that points to the location of the data.
Note: Commonly asked Azure Data Factory interview question