Latest Databricks Databricks-Certified-Professional-Data-Engineer Practice Test Questions, Databricks Certified Professional Data Engineer Exam Exam Dumps [Q29-Q49]

Share

Latest Databricks Databricks-Certified-Professional-Data-Engineer Practice Test Questions, Databricks Certified Professional Data Engineer Exam Exam Dumps

Oct-2024 Pass Databricks Databricks-Certified-Professional-Data-Engineer Exam in First Attempt Easily


Databricks is a platform that offers a cloud-based environment for data engineering, data science, and machine learning. It is designed to simplify data processing and analysis, allowing users to collaborate on projects, access pre-built libraries, and scale their workloads. To ensure that users have the necessary skills and knowledge to work with Databricks, the company offers a certification program. One of the certifications available is the Databricks-Certified-Professional-Data-Engineer (Databricks Certified Professional Data Engineer) certification exam.

 

NEW QUESTION # 29
Which of the following commands can be used to run one notebook from another notebook?

  • A. dbutils.notebook.run("full notebook path")
  • B. spark.notebook.run("full notebook path")
  • C. only job clusters can run notebook
  • D. notebook.utils.run("full notebook path")
  • E. execute.utils.run("full notebook path")

Answer: A

Explanation:
Explanation
The answer is dbutils.notebook.run(" full notebook path ")
Here is the full command with additional options.
run(path: String, timeout_seconds: int, arguments: Map): String
1.dbutils.notebook.run("ful-notebook-name", 60, {"argument": "data", "argument2": "data2", ...})


NEW QUESTION # 30
The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?

  • A. An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.
  • B. A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.
  • C. The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.
  • D. An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.
  • E. The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.

Answer: C

Explanation:
This code is using the pyspark.sql.functions library to group the silver_customer_sales table by customer_id and then aggregate the data using the minimum sale date, maximum sale total, and sum of distinct order ids.
The resulting aggregated data is then written to the gold_customer_lifetime_sales_summary table, overwriting any existing data in that table. This is a batch job that does not use any incremental or streaming logic, and does not perform any merge or update operations. Therefore, the code will overwrite the gold table with the aggregated values from the silver table every time it is executed. References:
* https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html
* https://docs.databricks.com/spark/latest/dataframes-datasets/transforming-data-with-dataframes.html
* https://docs.databricks.com/spark/latest/dataframes-datasets/aggregating-data-with-dataframes.html


NEW QUESTION # 31
One of the team members Steve who has the ability to create views, created a new view called re-gional_sales_vw on the existing table called sales which is owned by John, and the second team member Kevin who works with regional sales managers wanted to query the data in region-al_sales_vw, so Steve granted the permission to Kevin using command GRANT VIEW, USAGE ON regional_sales_vw to [email protected] but Kevin is still unable to access the view?

  • A. Table access control is not enabled on the table and view
  • B. Kevin needs select access on the table sales
  • C. Kevin is not the owner of the sales table
  • D. Steve is not the owner of the sales table
  • E. Kevin needs owner access on the view regional_sales_vw

Answer: D

Explanation:
Explanation
Ownership determines whether or not you can grant privileges on derived objects to other users, since Steve is not the owner of the underlying sales table, he can not grant access to the table or data in the table indirectly.
Only owner(user or group) can grant access to a object
https://docs.microsoft.com/en-us/azure/databricks/security/access-control/table-acls/object-privileges#a-user-has Data object privileges - Azure Databricks | Microsoft Doc


NEW QUESTION # 32
A table is registered with the following code:

Bothusersandordersare Delta Lake tables. Which statement describes the results of queryingrecent_orders?

  • A. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
  • B. All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.
  • C. The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.
  • D. Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.
  • E. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

Answer: B


NEW QUESTION # 33
How do you handle failures gracefully when writing code in Pyspark, fill in the blanks to complete the below statement
1._____
2.
3. Spark.read.table("table_name").select("column").write.mode("append").SaveAsTable("new_table_name")
4.
5._____
6.
7. print(f"query failed")

  • A. try: catch:
  • B. try: except:
  • C. try: fail:
  • D. try: failure:
  • E. try: error:

Answer: B

Explanation:
Explanation
The answer is try: and except:


NEW QUESTION # 34
Data engineering team is required to share the data with Data science team and both the teams are using different workspaces in the same organizationwhich of the following techniques can be used to simplify sharing data across?
*Please note the question is asking how data is shared within an organization across multiple workspaces.

  • A. Unity Catalog
  • B. DELTA lake
  • C. Use a single storage location
  • D. Data Sharing
  • E. DELTA LIVE Pipelines

Answer: A

Explanation:
Explanation
The answer is the Unity catalog.
Diagram Description automatically generated

Unity Catalog works at the Account level, it has the ability to create a meta store and attach that meta store to many workspaces see the below diagram to understand how Unity Catalog Works, as you can see a metastore can now be shared with both workspaces using Unity Catalog, prior to Unity Catalog the options was to use single cloud object storage manually mount in the second databricks workspace, and you can see here Unity Catalog really simplifies that.
Diagram Description automatically generated with medium confidence

sorry for the inconvenience watermark was added because other people on Udemy are copying my questions and images.
duct features
https://databricks.com/product/unity-catalog


NEW QUESTION # 35
Which of the following commands will return records from an existing Delta table my_table where duplicates
have been removed?

  • A. 1. MERGE INTO my_table a
    2. USING new_records b;
  • B. 1. DROP DUPLICATES
    2. FROM my_table;
  • C. 1. SELECT *
    2. FROM my_table
    3. WHERE duplicate = False;
  • D. 1. SELECT DISTINCT *
    2. FROM my_table;
  • E. 1. MERGE INTO my_table a
    2. USING new_records b ON a.id = b.id
    3. WHEN NOT MATCHED
    4. THEN INSERT *;

Answer: D


NEW QUESTION # 36
A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Which command should be removed from the notebook before scheduling it as a job?

  • A. Cmd 6
  • B. Cmd 3
  • C. Cmd 2
  • D. Cmd 4
  • E. Cmd 5

Answer: A

Explanation:
Cmd 6 is the command that should be removed from the notebook before scheduling it as a job. This command is selecting all the columns from the finalDF dataframe and displaying them in the notebook. This is not necessary for the job, as the finalDF dataframe is already written to a table in Cmd 7. Displaying the dataframe in the notebook will only consume resources and time, and it will not affect the output of the job. Therefore, Cmd 6 is redundant and should be removed.
The other commands are essential for the job, as they perform the following tasks:
Cmd 1: Reads the raw_data table into a Spark dataframe called rawDF.
Cmd 2: Prints the schema of the rawDF dataframe, which is useful for debugging and understanding the data structure.
Cmd 3: Selects all the columns from the rawDF dataframe, as well as the nested columns from the values struct column, and creates a new dataframe called flattenedDF.
Cmd 4: Drops the values column from the flattenedDF dataframe, as it is no longer needed after flattening, and creates a new dataframe called finalDF.
Cmd 5: Explains the physical plan of the finalDF dataframe, which is useful for optimizing and tuning the performance of the job.
Cmd 7: Writes the finalDF dataframe to a table called flat_data, using the append mode to add new data to the existing table.


NEW QUESTION # 37
All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:
key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.
Which of the following solutions meets the requirements?

  • A. Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.
  • B. Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.
  • C. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.
  • D. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.
  • E. All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.

Answer: C

Explanation:
Partitioning the data by the topic field allows the company to apply different access control policies and retention policies for different topics. For example, the company can use the Table Access Control feature to grant or revoke permissions to the registration topic based on user roles or groups. The company can also use the DELETE command to remove records from the registration topic that are older than 14 days, while keeping the records from other topics indefinitely. Partitioning by the topic field also improves the performance of queries that filter by the topic field, as they can skip reading irrelevant partitions. Reference:
Table Access Control: https://docs.databricks.com/security/access-control/table-acls/index.html DELETE: https://docs.databricks.com/delta/delta-update.html#delete-from-a-table


NEW QUESTION # 38
A Data engineer wants to run unit's tests using common Python testing frameworks on python functions defined across several Databricks notebooks currently used in production.
How can the data engineer run unit tests against function that work with data in production?

  • A. Define units test and functions within the same notebook
  • B. Define and unit test functions using Files in Repos
  • C. Define and import unit test functions from a separate Databricks notebook
  • D. Run unit tests against non-production data that closely mirrors production

Answer: D

Explanation:
The best practice for running unit tests on functions that interact with data is to use a dataset that closely mirrors the production data. This approach allows data engineers to validate the logic of their functions without the risk of affecting the actual production data. It's important to have a representative sample of production data to catch edge cases and ensure the functions will work correctly when used in a production environment.
Reference:
Databricks Documentation on Testing: Testing and Validation of Data and Notebooks


NEW QUESTION # 39
A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.
When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

  • A. Bytes Received never exceeds 80 million bytes per second
  • B. Total Disk Space remains constant
  • C. The five Minute Load Average remains consistent/flat
  • D. Network I/O never spikes
  • E. Overall cluster CPU utilization is around 25%

Answer: E

Explanation:
Explanation
This is the correct answer because it indicates a bottleneck caused by code executing on the driver. A bottleneck is a situation where the performance or capacity of a system is limited by a single component or resource. A bottleneck can cause slow execution, high latency, or low throughput. A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor. When evaluating the Ganglia Metrics for this cluster, one can look for indicators that show how the cluster resources are being utilized, such as CPU, memory, disk, or network. If the overall cluster CPU utilization is around 25%, it means that only one out of the four nodes (driver + 3 executors) is using its full CPU capacity, while the other three nodes are idle or underutilized. This suggests that the code executing on the driver is taking too long or consuming too much CPU resources, preventing the executors from receiving tasks or data to process. This can happen when the code has driver-side operations that are not parallelized or distributed, such as collecting large amounts of data to the driver, performing complex calculations on the driver, or using non-Spark libraries on the driver. Verified References: [Databricks Certified Data Engineer Professional], under "Spark Core" section; Databricks Documentation, under "View cluster status and event logs - Ganglia metrics" section; Databricks Documentation, under "Avoid collecting large RDDs" section.
In a Spark cluster, the driver node is responsible for managing the execution of the Spark application, including scheduling tasks, managing the execution plan, and interacting with the cluster manager. If the overall cluster CPU utilization is low (e.g., around 25%), it may indicate that the driver node is not utilizing the available resources effectively and might be a bottleneck.


NEW QUESTION # 40
A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in abronzetable created with the propertydelta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Which statement describes the execution and results of running the above query multiple times?

  • A. Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.
  • B. Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.
  • C. Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
  • D. Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.
  • E. Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.

Answer: C

Explanation:
Reading table's changes, captured by CDF, using spark.read means that you are reading them as a static source. So, each time you run the query, all table's changes (starting from the specified startingVersion) will be read.


NEW QUESTION # 41
The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.
The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series ofVACUUMcommands on all Delta Lake tables throughout the organization.
The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.
Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

  • A. Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the vacuum job is run the following day.
  • B. Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.
  • C. Because the vacuum command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.
  • D. Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.
  • E. Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the vacuum job is run 8 days later.

Answer: E

Explanation:
https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum


NEW QUESTION # 42
A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.
The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

  • A. No; Delta Lake manages streaming checkpoints in the transaction log.
  • B. No; only one stream can write to a Delta Lake table.
  • C. Yes; Delta Lake supports infinite concurrent writers.
  • D. Yes; both of the streams can share a single checkpoint directory.
  • E. No; each of the streams needs to have its own checkpoint directory.

Answer: E

Explanation:
Explanation
This is the correct answer because checkpointing is a critical feature of Structured Streaming that provides fault tolerance and recovery in case of failures. Checkpointing stores the current state and progress of a streaming query in a reliable storage system, such as DBFS or S3. Each streaming query must have its own checkpoint directory that is unique and exclusive to that query. If two streaming queries share the same checkpoint directory, they will interfere with each other and cause unexpected errors or data loss. Verified References: [Databricks Certified Data Engineer Professional], under "Structured Streaming" section; Databricks Documentation, under "Checkpointing" section.


NEW QUESTION # 43
Data engineering team has a job currently setup to run a task load data into a reporting table every day at 8: 00 AM takes about 20 mins, Operations teams are planning to use that data to run a second job, so they access latest complete set of data. What is the best to way to orchestrate this job setup?

  • A. Use Auto Loader to run every 20 mins to read the initial table and set the trigger to once and create a second job
  • B. Setup a Delta live to table based on the first table, set the job to run in continuous mode
  • C. Add Operation reporting task in the same job and set the operations reporting task to depend on Data Engineering task
  • D. Setup a second job to run at 8:20 AM in the same workspace
  • E. Add Operation reporting task in the same job and set the Data Engineering task to de-pend on Operations reporting task

Answer: C

Explanation:
Explanation
The answer is Add Operation reporting task in the same job and set the operations reporting task to depend on Data Engineering task.

Diagram Description automatically generated with medium confidence


NEW QUESTION # 44
A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks.
One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.
What approach would allow them to do this?

  • A. Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.
  • B. Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.
  • C. Maintain data quality rules in a Delta table outside of this pipeline's target schema, providing the schema name as a pipeline parameter.
  • D. Maintain data quality rules in a separate Databricks notebook that each DLT notebook of file.

Answer: C

Explanation:
Maintaining data quality rules in a centralized Delta table allows for the reuse of these rules across multiple DLT (Delta Live Tables) pipelines. By storing these rules outside the pipeline's target schema and referencing the schema name as a pipeline parameter, the team can apply the same set of data quality checks to different tables within the pipeline. This approach ensures consistency in data quality validations and reduces redundancy in code by not having to replicate the same rules in each DLT notebook or file.
References:
* Databricks Documentation on Delta Live Tables: Delta Live Tables Guide


NEW QUESTION # 45
Why does AUTO LOADER require schema location?

  • A. Schema location is used to identify the schema of target table
  • B. AUTO LOADER does not require schema location, because its supports Schema evolution
  • C. Schema location is used to store schema inferred by AUTO LOADER
  • D. Schema location is used to store user provided schema
  • E. Schema location is used to identify the schema of target table and source table

Answer: C

Explanation:
Explanation
The answer is, Schema location is used to store schema inferred by AUTO LOADER, so the next time AUTO LOADER runs faster as does not need to infer the schema every single time by trying to use the last known schema.
Auto Loader samples the first 50 GB or 1000 files that it discovers, whichever limit is crossed first. To avoid incurring this inference cost at every stream start up, and to be able to provide a stable schema across stream restarts, you must set the option cloudFiles.schemaLocation. Auto Loader creates a hidden directory _schemas at this location to track schema changes to the input data over time.
The below link contains detailed documentation on different options
Auto Loader options | Databricks on AWS


NEW QUESTION # 46
Data science team members are using a single cluster to perform data analysis, although cluster size was chosen to handle multiple users and auto-scaling was enabled, the team realized queries are still running slow, what would be the suggested fix for this?

  • A. Setup multiple clusters so each team member has their own cluster
  • B. Disable the auto-scaling feature
  • C. Use High concurrency mode instead of the standard mode
  • D. Increase the size of the driver node

Answer: C

Explanation:
Explanation
The answer is Use High concurrency mode instead of the standard mode,
https://docs.databricks.com/clusters/cluster-config-best-practices.html#cluster-mode High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc jobs.
Databricks recommends enabling autoscaling for High Concurrency clusters.


NEW QUESTION # 47
When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

  • A. Cluster: Existing All-Purpose Cluster;
    Retries: Unlimited;
    Maximum Concurrent Runs: 1
  • B. Cluster: New Job Cluster;
    Retries: None;
    Maximum Concurrent Runs: 1
  • C. Cluster: New Job Cluster;
    Retries: Unlimited;
    Maximum Concurrent Runs: Unlimited
  • D. Cluster: Existing All-Purpose Cluster;
    Retries: None;
    Maximum Concurrent Runs: 1
  • E. Cluster: Existing All-Purpose Cluster;
    Retries: Unlimited;
    Maximum Concurrent Runs: 1

Answer: E

Explanation:
The configuration that automatically recovers from query failures and keeps costs low is to use a new job cluster, set retries to unlimited, and set maximum concurrent runs to 1. This configuration has the following advantages:
* A new job cluster is a cluster that is created and terminated for each job run. This means that the cluster resources are only used when the job is running, and no idle costs are incurred. This also ensures that the cluster is always in a clean state and has the latest configuration and libraries for the job1.
* Setting retries to unlimited means that the job will automatically restart the query in case of any failure, such as network issues, node failures, or transient errors. This improves the reliability and availability of the streaming job, and avoids data loss or inconsistency2.
* Setting maximum concurrent runs to 1 means that only one instance of the job can run at a time. This prevents multiple queries from competing for the same resources or writing to the same output location, which can cause performance degradation or data corruption3.
Therefore, this configuration is the best practice for scheduling Structured Streaming jobs for production, as it ensures that the job is resilient, efficient, and consistent.
References: Job clusters, Job retries, Maximum concurrent runs


NEW QUESTION # 48
What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

  • A. Run source env/bin/activate in a notebook setup script
  • B. Use &Pip install in a notebook cell
  • C. Use &sh install in a notebook cell
  • D. Install libraries from PyPi using the cluster UI

Answer: D

Explanation:
Installing a Python package scoped at the notebook level to all nodes in the currently active cluster in Databricks can be achieved by using the Libraries tab in the cluster UI. This interface allows you to install libraries across all nodes in the cluster. While the %pip command in a notebook cell would only affect the driver node, using the cluster UI ensures that the package is installed on all nodes.
Reference:
Databricks Documentation on Libraries: Libraries


NEW QUESTION # 49
......

Free Databricks-Certified-Professional-Data-Engineer Exam Files Downloaded Instantly 100% Dumps & Practice Exam: https://realpdf.pass4suresvce.com/Databricks-Certified-Professional-Data-Engineer-pass4sure-vce-dumps.html