Which of the following likely explains these smaller file sizes?

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were...

May 18, 2025 No Comments READ MORE +

If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings. The below query is used to create the alert:...

May 14, 2025 No Comments READ MORE +

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable: Assume that the fields customer_id and order_id serve as a composite key...

May 10, 2025 No Comments READ MORE +

Which statement describes how data will be filtered?

A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE This table is partitioned by the date column. A query is run with the following filter: longitude < 20 & longitude...

May 5, 2025 No Comments READ MORE +

Which code block should be used to create the date Python variable used in the above code block?

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code: df = spark.read.format("parquet").load(f"/mnt/source/(date)") Which code block should be used to...

April 27, 2025 No Comments READ MORE +

Which of the following is true of Delta Lake and the Lakehouse?

Which of the following is true of Delta Lake and the Lakehouse?A . Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.B . Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in...

April 26, 2025 No Comments READ MORE +

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create. Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?A . Three new jobs named "Ingest new data" will be...

April 24, 2025 No Comments READ MORE +

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30...

April 21, 2025 No Comments READ MORE +

Which statement characterizes the general programming model used by Spark Structured Streaming?

Which statement characterizes the general programming model used by Spark Structured Streaming?A . Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.B . Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.C . Structured Streaming uses specialized hardware and I/O...

April 17, 2025 No Comments READ MORE +

Which statement describes the execution and results of running the above query multiple times?

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code...

April 15, 2025 No Comments READ MORE +