Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?
The data engineering team maintains the following code: Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?A . A batch job will update the enriched_itemized_orders_by_account table, replacing only...
Which statement describes this implementation?
The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table. The following logic is used to process these records. Which statement describes this implementation?A . The customers table is implemented as a Type 3 table; old values are maintained...
Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?
Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?A . spark.sql.files.maxPartitionBytesB . spark.sql.autoBroadcastJoinThresholdC . spark.sql.files.openCostInBytesD . spark.sql.adaptive.coalescePartitions.minPartitionNumE . spark.sql.adaptive.advisoryPartitionSizeInBytesView AnswerAnswer: A Explanation: This is the correct answer because spark.sql.files.maxPartitionBytes is a configuration parameter that directly affects the size of a spark-partition upon ingestion...
Which statement exemplifies best practices for implementing this system?
The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will...
When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?
When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?A . Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: UnlimitedB . Cluster: New Job Cluster; Retries: None; Maximum Concurrent Runs: 1C . Cluster: Existing All-Purpose Cluster; Retries: Unlimited; Maximum Concurrent Runs:...
Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?
The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings. The team has decided to process all deletions from the previous week as...
When this query is executed, what will happen with new records that have the same event_id as an existing record?
A junior data engineer on your team has implemented the following code block. The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table. When this query is executed, what will happen with...
Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?
A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple,...
What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?
The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic. What are the maximum notebook...
Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?
To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries. The data engineering team has been made aware of new requirements from a customer-facing...