[Q12-Q37] Get instant access to Databricks-Certified-Professional-Data-Engineer Practice Tests 2022 Free Updated Today!

Share

Get instant access to Databricks-Certified-Professional-Data-Engineer Practice Tests 2022 Free Updated Today!

Welcome to download the newest PassLeader Databricks-Certified-Professional-Data-Engineer PDF dumps ( 61 Q&As)

NEW QUESTION 12
A table customerLocations exists with the following schema:
1. id STRING,
2. date STRING,
3. city STRING,
4. country STRING
A senior data engineer wants to create a new table from this table using the following command:
1. CREATE TABLE customersPerCountry AS
2. SELECT country,
3. COUNT(*) AS customers
4. FROM customerLocations
5. GROUP BY country;
A junior data engineer asks why the schema is not being declared for the new table. Which of the following
responses explains why declaring the schema is not necessary?

  • A. CREATE TABLE AS SELECT statements result in tables where schemas are optional
  • B. CREATE TABLE AS SELECT statements result in tables that do not support schemas
  • C. CREATE TABLE AS SELECT statements infer the schema by scanning the data
  • D. CREATE TABLE AS SELECT statements adopt schema details from the source table and query
  • E. CREATE TABLE AS SELECT statements assign all columns the type STRING

Answer: D

 

NEW QUESTION 13
A data engineer needs to dynamically create a table name string using three Python varia-bles: region, store,
and year. An example of a table name is below when region = "nyc", store = "100", and year = "2021":
nyc100_sales_2021
Which of the following commands should the data engineer use to construct the table name in Py-thon?

  • A. "{region}{store}_sales_2022"
  • B. f"{region}+{store}+_sales_+2022"
  • C. "{region}+{store}+_sales_+2022"
  • D. "{region}+{store}+"_sales_"+2022"
  • E. f"{region}{store}_sales_2022"

Answer: E

 

NEW QUESTION 14
What are the advantages of the Hashing Features?

  • A. Requires the less memory
  • B. Less pass through the training data
  • C. Easily reverse engineer vectors to determine which original feature mapped to a vector location

Answer: A,B

Explanation:
Explanation
SGD-based classifiers avoid the need to predetermine vector size by simply picking a reasonable size and
shoehorning the training data into vectors of that size. This approach is known as feature hashing. The
shoehorning is done by picking one or more locations by using a hash of the name of the variable for
continuous variables or a hash of the variable name and the category name or word for categorical, text*like, or
word-like data.
This hashed feature approach has the distinct advantage of requiring less memory and one less pass through
the training data, but it can make it much harder to reverse engineer vectors to determine which original
feature mapped to a vector location. This is because multiple features may hash to the same location. With
large vectors or with multiple locations per feature, this isn't a problem for accuracy but it can make it hard to
understand what a classifier is doing.
An additional benefit of feature hashing is that the unknown and unbounded vocabularies typical of word-like
variables aren't a problem.

 

NEW QUESTION 15
Which of the following is a Continuous Probability Distributions?

  • A. Normal probability distribution
  • B. Binomial probability distribution
  • C. Negative binomial distribution
  • D. Poisson probability distribution

Answer: A

 

NEW QUESTION 16
Which of the following Structured Streaming queries is performing a hop from a Bronze table to a Silver
table?

  • A. 1. (spark.readStream.load(rawSalesLocation)
    2. .writeStream
    3. .option("checkpointLocation", checkpointPath)
    4. .outputMode("append")
    5. .table("uncleanedSales")
    6. )
  • B. 1. (spark.read.load(rawSalesLocation)
    2. .writeStream
    3. .option("checkpointLocation", checkpointPath)
    4. .outputMode("append")
    5. .table("uncleanedSales")
    6. )
  • C. 1. (spark.table("sales")
    2. .withColumn("avgPrice", col("sales") / col("units"))
    3. .writeStream
    4. .option("checkpointLocation", checkpointPath)
    5. .outputMode("append")
    6. .table("cleanedSales")
    7.)
  • D. 1. (spark.table("sales")
    2. .agg(sum("sales"),
    3. sum("units"))
    4. .writeStream
    5. .option("checkpointLocation", checkpointPath)
    6. .outputMode("complete")
    7. .table("aggregatedSales")
    8. )
  • E. 1. (spark.table("sales")
    2. .groupBy("store")
    3. .agg(sum("sales"))
    4. .writeStream
    5. .option("checkpointLocation", checkpointPath)
    6. .outputMode("complete")
    7. .table("aggregatedSales")
    8.)

Answer: C

 

NEW QUESTION 17
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then
perform a streaming write into a new table. The code block used by the data engineer is below:
1. (spark.table("sales")
2. .withColumn("avg_price", col("sales") / col("units"))
3. .writeStream
4. .option("checkpointLocation", checkpointPath)
5. .outputMode("complete")
6. ._____
7. .table("new_sales")
8.)
If the data engineer only wants the query to execute a single micro-batch to process all of the available data,
which of the following lines of code should the data engineer use to fill in the blank?

  • A. .processingTime("once")
  • B. .processingTime(1)
  • C. .trigger(once=True)
  • D. .trigger(processingTime="once")
  • E. .trigger(continuous="once")

Answer: C

 

NEW QUESTION 18
A data engineer has created a Delta table as part of a data pipeline. Downstream data analysts now need
SELECT permission on the Delta table.
Assuming the data engineer is the Delta table owner, which part of the Databricks Lakehouse Plat-form can
the data engineer use to grant the data analysts the appropriate access?

  • A. Databricks Filesystem
  • B. Data Explorer
  • C. Jobs
    B Dashboards
  • D. Repos

Answer: D

 

NEW QUESTION 19
Which of the following SQL keywords can be used to append new rows to an existing Delta table?

  • A. COPY
  • B. INSERT INTO
  • C. UPDATE
  • D. UNION
  • E. DELETE

Answer: B

 

NEW QUESTION 20
Which of the following describes how Databricks Repos can help facilitate CI/CD workflows on the
Databricks Lakehouse Platform?

  • A. Databricks Repos can commit or push code changes to trigger a CI/CD process
  • B. Databricks Repos can be used to design, develop, and trigger Git automation pipelines
  • C. Databricks Repos can merge changes from a secondary Git branch into a main Git branch
  • D. Databricks Repos can facilitate the pull request, review, and approval process before merging branches
  • E. Databricks Repos can store the single-source-of-truth Git repository

Answer: A

 

NEW QUESTION 21
Which of the following describes a benefit of a data lakehouse that is unavailable in a traditional data
warehouse?

  • A. A data lakehouse couples storage and compute for complete control
  • B. A data lakehouse provides a relational system of data management
  • C. A data lakehouse enables both batch and streaming analytics
  • D. A data lakehouse utilizes proprietary storage formats for data
  • E. A data lakehouse captures snapshots of data for version control purposes

Answer: C

 

NEW QUESTION 22
A data analyst has noticed that their Databricks SQL queries are running too slowly. They claim that this issue
is affecting all of their sequentially run queries. They ask the data engineering team for help. The data
engineering team notices that each of the queries uses the same SQL endpoint, but the SQL endpoint is not
used by any other user.
Which of the following approaches can the data engineering team use to improve the latency of the data
analyst's queries?

  • A. They can turn on the Serverless feature for the SQL endpoint
  • B. They can increase the cluster size of the SQL endpoint
  • C. They can turn on the Auto Stop feature for the SQL endpoint
  • D. They can increase the maximum bound of the SQL endpoint's scaling range
  • E. They can turn on the Serverless feature for the SQL endpoint and change the Spot In-stance Policy to
    "Reliability Optimized"

Answer: B

 

NEW QUESTION 23
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE.
Three datasets are defined against Delta Lake table sources using LIVE TABLE . The table is configured to
run in Development mode using the Triggered Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after
clicking Start to update the pipeline?

  • A. All datasets will be updated continuously and the pipeline will not shut down. The compute resources
    will persist with the pipeline
  • B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
    persist after the pipeline is stopped to allow for additional testing
  • C. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to
    allow for additional testing
  • D. All datasets will be updated once and the pipeline will shut down. The compute resources will be
    terminated
  • E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
    be deployed for the update and terminated when the pipeline is stopped

Answer: C

 

NEW QUESTION 24
Which of the following locations hosts the driver and worker nodes of a Databricks-managed clus-ter?

  • A. Databricks Filesystem
  • B. Data plane
  • C. JDBC data source
  • D. Databricks web application
  • E. Control plane

Answer: B

Explanation:
Explanation
See the Databricks high-level architecture

 

NEW QUESTION 25
A data engineering team has created a series of tables using Parquet data stored in an external sys-tem. The
team is noticing that after appending new rows to the data in the external system, their queries within
Databricks are not returning the new rows. They identify the caching of the previous data as the cause of this
issue.
Which of the following approaches will ensure that the data returned by queries is always up-to-date?

  • A. The tables should be refreshed in the writing cluster before the next query is run
  • B. The tables should be updated before the next query is run
  • C. The tables should be stored in a cloud-based external system
  • D. The tables should be converted to the Delta format
  • E. The tables should be altered to include metadata to not cache

Answer: D

 

NEW QUESTION 26
A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also
used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The
data engineer needs to identify which files are new since the previous run in the pipeline, and set up the
pipeline to only ingest those new files with each run.
Which of the following tools can the data engineer use to solve this problem?

  • A. Databricks SQL
  • B. Delta Lake
  • C. Unity Catalog
  • D. Data Explorer
  • E. Auto Loader

Answer: E

 

NEW QUESTION 27
In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?

  • A. Data Preparation
  • B. Model Building
  • C. Discovery
  • D. Communicate Results

Answer: A

 

NEW QUESTION 28
An engineering manager uses a Databricks SQL query to monitor their team's progress on fixes related to
customer-reported bugs. The manager checks the results of the query every day, but they are manually
rerunning the query each day and waiting for the results.
Which of the following approaches can the manager use to ensure the results of the query are up-dated each
day?

  • A. They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL
  • B. They can schedule the query to run every 12 hours from the Jobs UI
  • C. They can schedule the query to run every 1 day from the Jobs UI
  • D. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL
  • E. They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL

Answer: D

 

NEW QUESTION 29
A data engineering manager has noticed that each of the queries in a Databricks SQL dashboard takes a few
minutes to update when they manually click the "Refresh" button. They are curious why this might be
occurring, so a team member provides a variety of reasons on why the delay might be occurring.
Which of the following reasons fails to explain why the dashboard might be taking a few minutes to update?

  • A. The queries attached to the dashboard might first be checking to determine if new data is available
  • B. The queries attached to the dashboard might take a few minutes to run under normal circumstances
  • C. The queries attached to the dashboard might all be connected to their own, unstarted Databricks clusters
  • D. The SQL endpoint being used by each of the queries might need a few minutes to start up
  • E. The Job associated with updating the dashboard might be using a non-pooled endpoint

Answer: E

 

NEW QUESTION 30
A data engineer needs to create a database called customer360 at the loca-tion /customer/customer360. The
data engineer is unsure if one of their colleagues has already created the database.
Which of the following commands should the data engineer run to complete this task?

  • A. CREATE DATABASE customer360 DELTA LOCATION '/customer/customer360';
  • B. CREATE DATABASE IF NOT EXISTS customer360 DELTA LOCATION '/customer/customer360';
  • C. CREATE DATABASE IF NOT EXISTS customer360 LOCATION '/customer/customer360';
  • D. CREATE DATABASE IF NOT EXISTS customer360;
  • E. CREATE DATABASE customer360 LOCATION '/customer/customer360';

Answer: C

 

NEW QUESTION 31
A junior data engineer needs to create a Spark SQL table my_table for which Spark manages both the data and
the metadata. The metadata and data should also be stored in the Databricks Filesystem (DBFS).
Which of the following commands should a senior data engineer share with the junior data engineer to
complete this task?

  • A. 1. CREATE TABLE my_table (id STRING, value STRING);
  • B. 1. CREATE TABLE my_table (id STRING, value STRING) USING DBFS;
  • C. 1. CREATE TABLE my_table (id STRING, value STRING) USING
    2. org.apache.spark.sql.parquet OPTIONS (PATH "storage-path")
  • D. 1. CREATE MANAGED TABLE my_table (id STRING, value STRING) USING
    2. org.apache.spark.sql.parquet OPTIONS (PATH "storage-path");
  • E. 1. CREATE MANAGED TABLE my_table (id STRING, value STRING);

Answer: A

 

NEW QUESTION 32
You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex,
Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of
the clusters, you notice that there is significant overlap between the clusters. What should you do?

  • A. Remove one of the measures
  • B. Increase the number of clusters
  • C. Decrease the number of clusters
  • D. Identify additional measures to add to the analysis

Answer: C

 

NEW QUESTION 33
Which of the following data workloads will utilize a Bronze table as its source?

  • A. A job that enriches data by parsing its timestamps into a human-readable format
  • B. A job that ingests raw data from a streaming source into the Lakehouse
  • C. A job that aggregates cleaned data to create standard summary statistics
  • D. A job that queries aggregated data to publish key insights into a dashboard
  • E. A job that develops a feature set for a machine learning application

Answer: A

 

NEW QUESTION 34
A data engineer is overwriting data in a table by deleting the table and recreating the table. Another data
engineer suggests that this is inefficient and the table should simply be overwritten instead.
Which of the following reasons to overwrite the table instead of deleting and recreating the table is incorrect?

  • A. Overwriting a table allows for concurrent queries to be completed while in progress
  • B. Overwriting a table results in a clean table history for logging and audit purposes
  • C. Overwriting a table is efficient because no files need to be deleted
  • D. Overwriting a table is an atomic operation and will not leave the table in an unfinished state
  • E. Overwriting a table maintains the old version of the table for Time Travel

Answer: B

 

NEW QUESTION 35
A data engineering team is in the process of converting their existing data pipeline to utilize Auto Loader for
incremental processing in the ingestion of JSON files. One data engineer comes across the following code
block in the Auto Loader documentation:
1. (streaming_df = spark.readStream.format("cloudFiles")
2. .option("cloudFiles.format", "json")
3. .option("cloudFiles.schemaLocation", schemaLocation)
4. .load(sourcePath))
Assuming that schemaLocation and sourcePath have been set correctly, which of the following changes does
the data engineer need to make to convert this code block to use Auto Loader to ingest the data?

  • A. There is no change required. The data engineer needs to ask their administrator to turn on Auto Loader
  • B. The data engineer needs to add the .autoLoader line before the .load(sourcePath) line
  • C. There is no change required. The inclusion of format("cloudFiles") enables the use of Auto Loader
  • D. The data engineer needs to change the format("cloudFiles") line to format("autoLoader")
  • E. There is no change required. Databricks automatically uses Auto Loader for streaming reads

Answer: C

 

NEW QUESTION 36
......

Oct-2022 Latest Pass4suresVCE Databricks-Certified-Professional-Data-Engineer Exam Dumps with PDF and Exam Engine: https://realpdf.pass4suresvce.com/Databricks-Certified-Professional-Data-Engineer-pass4sure-vce-dumps.html