Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Databricks Certified Associate Developer for Apache Spark 3.0 exam Online Training - Exam4Training

Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Databricks Certified Associate Developer for Apache Spark 3.0 exam Online Training

exams Uncategorized Databricks Certified Associate Developer for Apache Spark 3.0 Online Training 0 Comments

Question #1

Which of the following code blocks silently writes DataFrame itemsDf in avro format to location fileLocation if a file does not yet exist at that location?

A . itemsDf.write.avro(fileLocation)
B . itemsDf.write.format("avro").mode("ignore").save(fileLocation)
C . itemsDf.write.format("avro").mode("errorifexists").save(fileLocation)
D . itemsDf.save.format("avro").mode("ignore").write(fileLocation)
E . spark.DataFrameWriter(itemsDf).format("avro").write(fileLocation)

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The trick in this QUESTION NO: is knowing the "modes" of the DataFrameWriter. Mode ignore will ignore if a file already exists and not replace that file, but also not throw an error. Mode errorifexists will throw an error, and is the default mode of the DataFrameWriter. The QUESTION NO: explicitly calls for the DataFrame to be "silently" written if it does not exist, so you need to specify mode("ignore") here to avoid having Spark communicate any error to you if the file already exists.

The `overwrite’ mode would not be right here, since, although it would be silent, it would overwrite the already-existing file. This is not what the QUESTION NO: asks for.

It is worth noting that the option starting with spark.DataFrameWriter(itemsDf) cannot work, since spark references the SparkSession object, but that object does not provide the DataFrameWriter.

As you can see in the documentation (below), DataFrameWriter is part of PySpark’s SQL

API, but not of its SparkSession API.

More info:

DataFrameWriter: pyspark.sql.DataFrameWriter.save ― PySpark 3.1.1 documentation

SparkSession API: Spark SQL ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,59.(Databricks import instructions)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #30

5

Reveal Solution Hide Solution

Correct Answer: A

Explanation:

This QUESTION NO: deals with the parameters of Spark’s split operator for strings. To solve this question, you first need to understand the difference between DataFrame.withColumn() and DataFrame.withColumnRenamed(). The correct option here is DataFrame.withColumn() since, according to the question, we want to add a column and not rename an existing column. This leaves you with only 3 answers to consider.

The second gap should be filled with the name of the new column to be added to the DataFrame. One of the remaining answers states the column name as itemNameBetweenSeparators, while the other two state it as "itemNameBetweenSeparators". The correct option here is "itemNameBetweenSeparators", since the other option would let Python try to interpret itemNameBetweenSeparators as the name of a variable, which we have not defined. This leaves you with 2 answers to consider.

The decision boils down to how to fill gap 5. Either with 4 or with 5. The QUESTION NO: asks for arrays of maximum four strings. The code in gap 5 relates to the limit parameter of Spark’s split operator

(see documentation linked below). The documentation states that "the resulting array’s

length will not be more than limit", meaning that we should pick the answer option with 4 as

the code in the

fifth gap here.

On a side note: One answer option includes a function str_split. This function does not exist in pySpark.

More info: pyspark.sql.functions.split ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #30

5

Reveal Solution Hide Solution

Correct Answer: A

Explanation:

This QUESTION NO: deals with the parameters of Spark’s split operator for strings. To solve this question, you first need to understand the difference between DataFrame.withColumn() and DataFrame.withColumnRenamed(). The correct option here is DataFrame.withColumn() since, according to the question, we want to add a column and not rename an existing column. This leaves you with only 3 answers to consider.

The second gap should be filled with the name of the new column to be added to the DataFrame. One of the remaining answers states the column name as itemNameBetweenSeparators, while the other two state it as "itemNameBetweenSeparators". The correct option here is "itemNameBetweenSeparators", since the other option would let Python try to interpret itemNameBetweenSeparators as the name of a variable, which we have not defined. This leaves you with 2 answers to consider.

The decision boils down to how to fill gap 5. Either with 4 or with 5. The QUESTION NO: asks for arrays of maximum four strings. The code in gap 5 relates to the limit parameter of Spark’s split operator

(see documentation linked below). The documentation states that "the resulting array’s

length will not be more than limit", meaning that we should pick the answer option with 4 as

the code in the

fifth gap here.

On a side note: One answer option includes a function str_split. This function does not exist in pySpark.

More info: pyspark.sql.functions.split ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #32

Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?

A . transactionsDf.sort(asc(value)).show(10)
B . transactionsDf.sort(col("value")).show(10)
C . transactionsDf.sort(col("value").desc()).head()
D . transactionsDf.sort(col("value").asc()).print(10)
E . transactionsDf.orderBy("value").asc().show(10)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

show() is the correct method to look for here, since the QUESTION NO: specifically asks for displaying the rows in a nicely formatted way. Here is the output of show (only a few rows shown):

+————-+———+—–+——-+———+—-+—————+

|transactionId|predError|value|storeId|productId| f|transactionDate|

+————- +——— +—– +——- +——— +—- +————— +

| 3| 3| 1| 25| 3|null| 1585824821|

| 5| null| 2| null| 2|null| 1575285427|

| 4| null| 3| 3| 2|null| 1583244275|

+————- +——— +—– +——- +——— +—- +————— +

With regards to the sorting, specifically in ascending order since the smallest values should be shown first, the following expressions are valid:

– transactionsDf.sort(col("value")) ("ascending" is the default sort direction in the sort method)

– transactionsDf.sort(asc(col("value")))

– transactionsDf.sort(asc("value"))

– transactionsDf.sort(transactionsDf.value.asc())

– transactionsDf.sort(transactionsDf.value)

Also, orderBy is just an alias of sort, so all of these expressions work equally well using orderBy.

Static notebook | Dynamic notebook: See test 1,

Question #32

Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?

A . transactionsDf.sort(asc(value)).show(10)
B . transactionsDf.sort(col("value")).show(10)
C . transactionsDf.sort(col("value").desc()).head()
D . transactionsDf.sort(col("value").asc()).print(10)
E . transactionsDf.orderBy("value").asc().show(10)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

show() is the correct method to look for here, since the QUESTION NO: specifically asks for displaying the rows in a nicely formatted way. Here is the output of show (only a few rows shown):

+————-+———+—–+——-+———+—-+—————+

|transactionId|predError|value|storeId|productId| f|transactionDate|

+————- +——— +—– +——- +——— +—- +————— +

| 3| 3| 1| 25| 3|null| 1585824821|

| 5| null| 2| null| 2|null| 1575285427|

| 4| null| 3| 3| 2|null| 1583244275|

+————- +——— +—– +——- +——— +—- +————— +

With regards to the sorting, specifically in ascending order since the smallest values should be shown first, the following expressions are valid:

– transactionsDf.sort(col("value")) ("ascending" is the default sort direction in the sort method)

– transactionsDf.sort(asc(col("value")))

– transactionsDf.sort(asc("value"))

– transactionsDf.sort(transactionsDf.value.asc())

– transactionsDf.sort(transactionsDf.value)

Also, orderBy is just an alias of sort, so all of these expressions work equally well using orderBy.

Static notebook | Dynamic notebook: See test 1,

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #120

parquet

Reveal Solution Hide Solution

Correct Answer: C

Question #121

Which of the following is a viable way to improve Spark’s performance when dealing with large amounts of data, given that there is only a single application running on the cluster?

A . Increase values for the properties spark.default.parallelism and spark.sql.shuffle.partitions
B . Decrease values for the properties spark.default.parallelism and spark.sql.partitions
C . Increase values for the properties spark.sql.parallelism and spark.sql.partitions
D . Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions
E . Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Decrease values for the properties spark.default.parallelism and spark.sql.partitions No, these values need to be increased.

Increase values for the properties spark.sql.parallelism and spark.sql.partitions Wrong, there is no property spark.sql.parallelism.

Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions See above.

Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions

The property spark.dynamicAllocation.maxExecutors is only in effect if dynamic allocation is enabled, using the spark.dynamicAllocation.enabled property. It is disabled by default. Dynamic

allocation can be useful when to run multiple applications on the same cluster in parallel. However, in this case there is only a single application running on the cluster, so enabling dynamic

allocation would not yield a performance benefit.

More info: Practical Spark Tips For Data Scientists | Experfy.com and Basics of Apache

Spark Configuration Settings | by Halil Ertan | Towards Data Science

(https://bit.ly/3gA0A6w ,

https://bit.ly/2QxhNTr)

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #135

count()

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY_2).count() Only persist takes different storage levels, so any option using cache() cannot be correct. persist() is evaluated lazily, so an action needs to follow this command. select() is not an action, but count() is C so all options using select() are incorrect.

Finally, the QUESTION NO: states that "the executors’ memory should be utilized as much as possible, but not writing anything to disk". This points to a MEMORY_ONLY storage level. In this storage level, partitions that do not fit into memory will be recomputed when they are needed, instead of being written to disk, as with the storage option MEMORY_AND_DISK. Since the data need to be duplicated across two executors, _2 needs to be appended to the storage level. Static notebook | Dynamic notebook: See test 2, 25.(Databricks import instructions)

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #166

col("value")

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.withColumn("cos", round(cos(degrees(transactionsDf.value)),2))

This QUESTION NO: is especially confusing because col, "cos" are so similar. Similar- looking answer options can also appear in the exam and, just like in this question, you need to pay attention to the details to identify what the correct answer option is.

The first answer option to throw out is the one that starts with withColumnRenamed: The QUESTION NO: speaks specifically of adding a column. The withColumnRenamed operator only renames an existing column, however, so you cannot use it here.

Next, you will have to decide what should be in gap 2, the first argument of transactionsDf.withColumn(). Looking at the documentation (linked below), you can find out that the first argument of

withColumn actually needs to be a string with the name of the column to be added. So, any answer that includes col("cos") as the option for gap 2 can be disregarded.

This leaves you with two possible answers. The real difference between these two answers is where the cos and degree methods are, either in gaps 3 and 4, or vice-versa. From the QUESTION NO: you

can find out that the new column should have "the values in column value converted to degrees and having the cosine of those converted values taken". This prescribes you a clear order of operations: First, you convert values from column value to degrees and then you take the cosine of those values. So, the inner parenthesis (gap 4) should contain the degree method and then,

logically, gap 3 holds the cos method. This leaves you with just one possible correct answer.

More info: pyspark.sql.DataFrame.withColumn ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 49.(Databricks import instructions)

Question #167

Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?

A . transactionsDf.groupBy(col(storeId).avg())
B . transactionsDf.groupBy("storeId").avg(col("value"))
C . transactionsDf.groupBy("storeId").agg(avg("value"))
D . transactionsDf.groupBy("storeId").agg(average("value"))
E . transactionsDf.groupBy("value").average()

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

This QUESTION NO: tests your knowledge about how to use the groupBy and agg pattern in Spark. Using the documentation, you can find out that there is no average() method in pyspark.sql.functions.

Static notebook | Dynamic notebook: See test 2, 42.(Databricks import instructions)

Question #167

Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?

A . transactionsDf.groupBy(col(storeId).avg())
B . transactionsDf.groupBy("storeId").avg(col("value"))
C . transactionsDf.groupBy("storeId").agg(avg("value"))
D . transactionsDf.groupBy("storeId").agg(average("value"))
E . transactionsDf.groupBy("value").average()

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

This QUESTION NO: tests your knowledge about how to use the groupBy and agg pattern in Spark. Using the documentation, you can find out that there is no average() method in pyspark.sql.functions.

Static notebook | Dynamic notebook: See test 2, 42.(Databricks import instructions)

Question #169

spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

A . The commas in the tuples with the colors should be eliminated.
B . The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.
C . Instead of color, a data type should be specified.
D . The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Correct code block:

spark.createDataFrame([("red",), ("blue",), ("green",)], ["color"])

The createDataFrame syntax is not exactly straightforward, but luckily the documentation (linked below) provides several examples on how to use it. It also shows an example very similar to the

code block presented here which should help you answer this QUESTION NO: correctly.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #169

spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

A . The commas in the tuples with the colors should be eliminated.
B . The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.
C . Instead of color, a data type should be specified.
D . The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Correct code block:

spark.createDataFrame([("red",), ("blue",), ("green",)], ["color"])

The createDataFrame syntax is not exactly straightforward, but luckily the documentation (linked below) provides several examples on how to use it. It also shows an example very similar to the

code block presented here which should help you answer this QUESTION NO: correctly.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #183

+————-+———+—–+——-+———+—-+

A . transactionsDf.max(‘value’).min(‘value’)
B . transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
C . transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
D . transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
E . transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.

transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Incorrect. While this is valid Spark syntax, it does not achieve what the QUESTION NO: asks for. The QUESTION NO: specifically asks for values to be aggregated per value in column productId –

this column is

not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.

transactionsDf.max(‘value’).min(‘value’)

Wrong. There is no DataFrame.max() method in Spark, so this command will fail.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand

which columns you want to aggregate.

More info: pyspark.sql.DataFrame.agg ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #183

+————-+———+—–+——-+———+—-+

A . transactionsDf.max(‘value’).min(‘value’)
B . transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
C . transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
D . transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
E . transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.

transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Incorrect. While this is valid Spark syntax, it does not achieve what the QUESTION NO: asks for. The QUESTION NO: specifically asks for values to be aggregated per value in column productId –

this column is

not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.

transactionsDf.max(‘value’).min(‘value’)

Wrong. There is no DataFrame.max() method in Spark, so this command will fail.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand

which columns you want to aggregate.

More info: pyspark.sql.DataFrame.agg ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #183

+————-+———+—–+——-+———+—-+

A . transactionsDf.max(‘value’).min(‘value’)
B . transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
C . transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
D . transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
E . transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.

transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Incorrect. While this is valid Spark syntax, it does not achieve what the QUESTION NO: asks for. The QUESTION NO: specifically asks for values to be aggregated per value in column productId –

this column is

not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.

transactionsDf.max(‘value’).min(‘value’)

Wrong. There is no DataFrame.max() method in Spark, so this command will fail.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand

which columns you want to aggregate.

More info: pyspark.sql.DataFrame.agg ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #183

+————-+———+—–+——-+———+—-+

A . transactionsDf.max(‘value’).min(‘value’)
B . transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
C . transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
D . transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
E . transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.

transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Incorrect. While this is valid Spark syntax, it does not achieve what the QUESTION NO: asks for. The QUESTION NO: specifically asks for values to be aggregated per value in column productId –

this column is

not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.

transactionsDf.max(‘value’).min(‘value’)

Wrong. There is no DataFrame.max() method in Spark, so this command will fail.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand

which columns you want to aggregate.

More info: pyspark.sql.DataFrame.agg ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))

articlesDf = articlesDf.groupby(‘col’).count()

articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Output of correct code block:

+——-+

| col|

+——-+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+——-+

Static notebook | Dynamic notebook: See test 2,

Question #217

"MM d (EEE)"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.withColumn("transactionDateForm", from_unixtime("transactionDate", "MMM d (EEEE)"))

The QUESTION NO: specifically asks about "adding" a column. In the context of all presented answers, DataFrame.withColumn() is the correct command for this. In theory, DataFrame.select() could also be

used for this purpose, if all existing columns are selected and a new one is added. DataFrame.withColumnRenamed() is not the appropriate command, since it can only rename existing columns, but cannot add a new column or change the value of a column.

Once DataFrame.withColumn() is chosen, you can read in the documentation (see below) that the first input argument to the method should be the column name of the new column. The final difficulty is the date format. The QUESTION NO: indicates that the date format

Apr 26 (Sunday) is desired. The answers give "MMM d (EEEE)" and "MM d (EEE)" as options. It can be hard to know the details of the date format that is used in Spark. Specifically, knowing the differences between MMM and MM is probably not something you deal with every day. But, there is an easy way to remember the difference: M (one letter) is usually the shortest form: 4 for April. MM includes padding: 04 for April. MMM (three letters) is the three-letter month abbreviation:

Apr for April. And

MMMM is the longest possible form: April. Knowing this four-letter sequence helps you select the correct option here.

More info: pyspark.sql.DataFrame.withColumn ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #217

"MM d (EEE)"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.withColumn("transactionDateForm", from_unixtime("transactionDate", "MMM d (EEEE)"))

The QUESTION NO: specifically asks about "adding" a column. In the context of all presented answers, DataFrame.withColumn() is the correct command for this. In theory, DataFrame.select() could also be

used for this purpose, if all existing columns are selected and a new one is added. DataFrame.withColumnRenamed() is not the appropriate command, since it can only rename existing columns, but cannot add a new column or change the value of a column.

Once DataFrame.withColumn() is chosen, you can read in the documentation (see below) that the first input argument to the method should be the column name of the new column. The final difficulty is the date format. The QUESTION NO: indicates that the date format

Apr 26 (Sunday) is desired. The answers give "MMM d (EEEE)" and "MM d (EEE)" as options. It can be hard to know the details of the date format that is used in Spark. Specifically, knowing the differences between MMM and MM is probably not something you deal with every day. But, there is an easy way to remember the difference: M (one letter) is usually the shortest form: 4 for April. MM includes padding: 04 for April. MMM (three letters) is the three-letter month abbreviation:

Apr for April. And

MMMM is the longest possible form: April. Knowing this four-letter sequence helps you select the correct option here.

More info: pyspark.sql.DataFrame.withColumn ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #217

"MM d (EEE)"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.withColumn("transactionDateForm", from_unixtime("transactionDate", "MMM d (EEEE)"))

The QUESTION NO: specifically asks about "adding" a column. In the context of all presented answers, DataFrame.withColumn() is the correct command for this. In theory, DataFrame.select() could also be

used for this purpose, if all existing columns are selected and a new one is added. DataFrame.withColumnRenamed() is not the appropriate command, since it can only rename existing columns, but cannot add a new column or change the value of a column.

Once DataFrame.withColumn() is chosen, you can read in the documentation (see below) that the first input argument to the method should be the column name of the new column. The final difficulty is the date format. The QUESTION NO: indicates that the date format

Apr 26 (Sunday) is desired. The answers give "MMM d (EEEE)" and "MM d (EEE)" as options. It can be hard to know the details of the date format that is used in Spark. Specifically, knowing the differences between MMM and MM is probably not something you deal with every day. But, there is an easy way to remember the difference: M (one letter) is usually the shortest form: 4 for April. MM includes padding: 04 for April. MMM (three letters) is the three-letter month abbreviation:

Apr for April. And

MMMM is the longest possible form: April. Knowing this four-letter sequence helps you select the correct option here.

More info: pyspark.sql.DataFrame.withColumn ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #227

importedDf = spark.read.json(jsonPath)

A . 4, 1, 2
B . 5, 1, 3
C . 5, 2
D . 4, 1, 3
E . 5, 1, 2

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

importedDf = spark.read.json(jsonPath)

importedDf.createOrReplaceTempView("importedDf")

spark.sql("SELECT * FROM importedDf WHERE productId != 3")

Option 5 is the only correct way listed of reading in a JSON in PySpark. The option("format", "json") is not the correct way to tell Spark’s DataFrameReader that you want to read a JSON file. You would do this through format("json") instead. Also, you can communicate the specific path of the JSON file to the DataFramReader using the load() method, not the path() method.

In order to use a SQL command through the SparkSession spark, you first need to create a temporary view through DataFrame.createOrReplaceTempView().

The SQL statement should start with the SELECT operator. The FILTER operator SQL provides is not the correct one to use here.

Static notebook | Dynamic notebook: See test 2,

Question #227

importedDf = spark.read.json(jsonPath)

A . 4, 1, 2
B . 5, 1, 3
C . 5, 2
D . 4, 1, 3
E . 5, 1, 2

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

importedDf = spark.read.json(jsonPath)

importedDf.createOrReplaceTempView("importedDf")

spark.sql("SELECT * FROM importedDf WHERE productId != 3")

Option 5 is the only correct way listed of reading in a JSON in PySpark. The option("format", "json") is not the correct way to tell Spark’s DataFrameReader that you want to read a JSON file. You would do this through format("json") instead. Also, you can communicate the specific path of the JSON file to the DataFramReader using the load() method, not the path() method.

In order to use a SQL command through the SparkSession spark, you first need to create a temporary view through DataFrame.createOrReplaceTempView().

The SQL statement should start with the SELECT operator. The FILTER operator SQL provides is not the correct one to use here.

Static notebook | Dynamic notebook: See test 2,

Question #229

Which of the following code blocks returns a copy of DataFrame transactionsDf where the column storeId has been converted to string type?

A . transactionsDf.withColumn("storeId", convert("storeId", "string"))
B . transactionsDf.withColumn("storeId", col("storeId", "string"))
C . transactionsDf.withColumn("storeId", col("storeId").convert("string"))
D . transactionsDf.withColumn("storeId", col("storeId").cast("string"))
E . transactionsDf.withColumn("storeId", convert("storeId").as("string"))

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

This QUESTION NO: asks for your knowledge about the cast syntax. cast is a method of the Column class. It is worth noting that one could also convert a column type using the Column.astype()

method, which is just an alias for cast.

Find more info in the documentation linked below.

More info: pyspark.sql.Column.cast ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 33.(Databricks import instructions)

Question #230

Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?

A . itemsDf.write.mode("overwrite").parquet(filePath)
B . itemsDf.write.option("parquet").mode("overwrite").path(filePath)
C . itemsDf.write(filePath, mode="overwrite")
D . itemsDf.write.mode("overwrite").path(filePath)
E . itemsDf.write().parquet(filePath, mode="overwrite")

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation: itemsDf.write.mode("overwrite").parquet(filePath)

Correct! itemsDf.write returns a pyspark.sql.DataFrameWriter instance whose overwriting behavior can be modified via the mode setting or by passing mode="overwrite" to the parquet() command.

Although the parquet format is not prescribed for solving this question, parquet() is a valid operator to initiate Spark to write the data to disk.

itemsDf.write.mode("overwrite").path(filePath)

No. A pyspark.sql.DataFrameWriter instance does not have a path() method.

itemsDf.write.option("parquet").mode("overwrite").path(filePath)

Incorrect, see above. In addition, a file format cannot be passed via the option() method.

itemsDf.write(filePath, mode="overwrite")

Wrong. Unfortunately, this is too simple. You need to obtain access to a DataFrameWriter for the DataFrame through calling itemsDf.write upon which you can apply further methods to control how

Spark data should be written to disk. You cannot, however, pass arguments to itemsDf.write directly.

itemsDf.write().parquet(filePath, mode="overwrite")

False. See above.

More info: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #230

Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?

A . itemsDf.write.mode("overwrite").parquet(filePath)
B . itemsDf.write.option("parquet").mode("overwrite").path(filePath)
C . itemsDf.write(filePath, mode="overwrite")
D . itemsDf.write.mode("overwrite").path(filePath)
E . itemsDf.write().parquet(filePath, mode="overwrite")

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation: itemsDf.write.mode("overwrite").parquet(filePath)

Correct! itemsDf.write returns a pyspark.sql.DataFrameWriter instance whose overwriting behavior can be modified via the mode setting or by passing mode="overwrite" to the parquet() command.

Although the parquet format is not prescribed for solving this question, parquet() is a valid operator to initiate Spark to write the data to disk.

itemsDf.write.mode("overwrite").path(filePath)

No. A pyspark.sql.DataFrameWriter instance does not have a path() method.

itemsDf.write.option("parquet").mode("overwrite").path(filePath)

Incorrect, see above. In addition, a file format cannot be passed via the option() method.

itemsDf.write(filePath, mode="overwrite")

Wrong. Unfortunately, this is too simple. You need to obtain access to a DataFrameWriter for the DataFrame through calling itemsDf.write upon which you can apply further methods to control how

Spark data should be written to disk. You cannot, however, pass arguments to itemsDf.write directly.

itemsDf.write().parquet(filePath, mode="overwrite")

False. See above.

More info: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #232

Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?

A . Slot is another name for executor.
B . There must be less executors than tasks.
C . An executor runs on a single core.
D . There must be more slots than tasks.
E . Tasks run in parallel via slots.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks run in parallel via slots.

Correct. Given the assumption, an executor then has one or more "slots", defined by the equation spark.executor.cores / spark.task.cpus. With the executor’s resources divided into slots, each task

takes up a slot and multiple tasks can be executed in parallel.

Slot is another name for executor.

No, a slot is part of an executor.

An executor runs on a single core.

No, an executor can occupy multiple cores. This is set by the spark.executor.cores option.

There must be more slots than tasks.

No. Slots just process tasks. One could imagine a scenario where there was just a single slot for multiple tasks, processing one task at a time. Granted C this is the opposite of what Spark should be

used for, which is distributed data processing over multiple cores and machines, performing many tasks in parallel.

There must be less executors than tasks.

No, there is no such requirement.

More info: Spark Architecture | Distributed Systems Architecture (https://bit.ly/3x4MZZt)

Question #233

Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format

month/day/year in column transactionDateFormatted?

Excerpt of DataFrame transactionsDf:

A . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))
B . transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))
C . transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFor matted")
D . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))
E . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate",

format="MM/dd/yyyy"))

Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark’s from_unixtime method to transform values in column

transactionDate into strings, following the format requested in the question.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column.

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFor matted")

Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped

data C but this is irrelevant for this question, since we do not deal with grouped data here.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this: 2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not

what is asked for in the question.

More info: pyspark.sql.functions.from_unixtime ― PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #233

Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format

month/day/year in column transactionDateFormatted?

Excerpt of DataFrame transactionsDf:

A . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))
B . transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))
C . transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFor matted")
D . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))
E . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate",

format="MM/dd/yyyy"))

Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark’s from_unixtime method to transform values in column

transactionDate into strings, following the format requested in the question.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column.

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFor matted")

Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped

data C but this is irrelevant for this question, since we do not deal with grouped data here.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this: 2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not

what is asked for in the question.

More info: pyspark.sql.functions.from_unixtime ― PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #260

transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))

Reveal Solution Hide Solution

Correct Answer: A

Explanation:

Recognizing that the UDF specification requires a return type (unless it is a string, which is the default) is important for solving this question. In addition, you should make sure that the generated

UDF (evaluateTestSuccessUDF) and not the Python function (evaluateTestSuccess) is applied to column storeId.

More info: pyspark.sql.functions.udf ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 34.(Databricks import instructions)

Question #261

Which of the following statements about broadcast variables is correct?

A . Broadcast variables are serialized with every single task.
B . Broadcast variables are commonly used for tables that do not fit into memory.
C . Broadcast variables are immutable.
D . Broadcast variables are occasionally dynamically updated on a per-task basis.
E . Broadcast variables are local to the worker node and not shared across the cluster.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Broadcast variables are local to the worker node and not shared across the cluster.

This is wrong because broadcast variables are meant to be shared across the cluster. As such, they are never just local to the worker node, but available to all worker nodes. Broadcast variables are commonly used for tables that do not fit into memory.

This is wrong because broadcast variables can only be broadcast because they are small and do fit into memory.

Broadcast variables are serialized with every single task.

This is wrong because they are cached on every machine in the cluster, precisely avoiding to have to be serialized with every single task.

Broadcast variables are occasionally dynamically updated on a per-task basis.

This is wrong because broadcast variables are immutable C they are never updated. More info: Spark C The Definitive Guide, Chapter 14

Question #262

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.

Code block:

transactionsDf.filter(col(‘predError’).in([3, 6])).count()

A . The number of rows cannot be determined with the count() operator.
B . Instead of filter, the select method should be used.
C . The method used on column predError is incorrect.
D . Instead of a list, the values need to be passed as single arguments to the in operator.
E . Numbers 3 and 6 need to be passed as string variables.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Correct code block:

transactionsDf.filter(col(‘predError’).isin([3, 6])).count()

The isin method is the correct one to use here C the in method does not exist for the Column object.

More info: pyspark.sql.Column.isin ― PySpark 3.1.2 documentation

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #275

spark.read.options("modifiedBefore", "2029-03-

20T05:44:46").schema(schema).load(filePath)

A . The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark’s DataFrameReader is incorrect.
B . Columns in the schema definition use the wrong object type and the syntax of the call to Spark’s DataFrameReader is incorrect.
C . The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
D . Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
E . Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Correct code block:

schema = StructType([

StructField("itemId", IntegerType(), True),

StructField("attributes", ArrayType(StringType(), True), True),

StructField("supplier", StringType(), True)

])

spark.read.options(modifiedBefore="2029-03-

20T05:44:46").schema(schema).parquet(filePath)

This QUESTION NO: is more difficult than what you would encounter in the exam. In the

exam, for this QUESTION NO: type, only one error needs to be identified and not "one or

multiple" as in the

question.

Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways

of expressing a schema in Spark. A StructType always contains a list of StructFields (see

documentation linked below). So, nesting StructType and StructType as shown in the QUESTION NO: is wrong.

The modification date threshold should be specified by a keyword argument like options(modifiedBefore="2029-03-20T05:44:46") and not two consecutive non-keyword arguments as in the original

code block (see documentation linked below).

Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for

example, DataFrameReader.parquet().

Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be

nullable and this is specified correctly by the third argument being True in the schema in the code block.

It is correct, however, that the modification date threshold is specified incorrectly (see above).

The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark’s DataFrameReader is incorrect.

Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer

above. In addition, the DataFrameReader is called correctly through the SparkSession spark.

Columns in the schema definition use the wrong object type and the syntax of the call to Spark’s DataFrameReader is incorrect.

Incorrect, the object types in the schema definition are correct and syntax of the call to Spark’s DataFrameReader is correct.

The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified

incorrectly (see correct answer above).

Question #276

Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?

A . itemsDf.cache().count()
B . itemsDf.cache(eager=True)
C . cache(itemsDf)
D . itemsDf.cache().filter()
E . itemsDf.rdd.storeCopy()

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Caching means storing a copy of a partition on an executor, so it can be accessed quicker by subsequent operations, instead of having to be recalculated. cache() is a lazily-evaluated method of the

DataFrame. Since count() is an action (while filter() is not), it triggers the caching process. More info: pyspark.sql.DataFrame.cache ― PySpark 3.1.2 documentation, Learning Spark, 2nd Edition, Chapter 7

Static notebook | Dynamic notebook: See test 2, 20.(Databricks import instructions)

Question #277

The code block displayed below contains an error. The code block is intended to join

DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.

Code block:

transactionsDf.join(itemsDf, "itemId", how="broadcast")

A . The syntax is wrong, how= should be removed from the code block.
B . The join method should be replaced by the broadcast method.
C . Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
D . The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
E . broadcast is not a valid join type.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

broadcast is not a valid join type.

Correct! The code block should read transactionsDf.join(broadcast(itemsDf), "itemId"). This would imply an inner join (this is the default in DataFrame.join()), but since the join type is not given in the

question, this would be a valid choice.

The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.

This option does not apply here, since the syntax around broadcasting is incorrect. Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.

No, it is enabled by default, since the spark.sql.autoBroadcastJoinThreshold property is set to 10 MB by default. If that property would be set to -1, then broadcast joining would be disabled.

More info: Performance Tuning – Spark 3.1.1 Documentation (https://bit.ly/3gCz34r) The join method should be replaced by the broadcast method.

No, DataFrame has no broadcast() method.

The syntax is wrong, how= should be removed from the code block.

No, having the keyword argument how= is totally acceptable.

Question #277

The code block displayed below contains an error. The code block is intended to join

DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.

Code block:

transactionsDf.join(itemsDf, "itemId", how="broadcast")

A . The syntax is wrong, how= should be removed from the code block.
B . The join method should be replaced by the broadcast method.
C . Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
D . The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
E . broadcast is not a valid join type.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

broadcast is not a valid join type.

Correct! The code block should read transactionsDf.join(broadcast(itemsDf), "itemId"). This would imply an inner join (this is the default in DataFrame.join()), but since the join type is not given in the

question, this would be a valid choice.

The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.

This option does not apply here, since the syntax around broadcasting is incorrect. Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.

No, it is enabled by default, since the spark.sql.autoBroadcastJoinThreshold property is set to 10 MB by default. If that property would be set to -1, then broadcast joining would be disabled.

More info: Performance Tuning – Spark 3.1.1 Documentation (https://bit.ly/3gCz34r) The join method should be replaced by the broadcast method.

No, DataFrame has no broadcast() method.

The syntax is wrong, how= should be removed from the code block.

No, having the keyword argument how= is totally acceptable.

Question #279

print(itemsDf.types)

B. itemsDf.printSchema()

C. spark.schema(itemsDf)

D. itemsDf.rdd.printSchema()

E. itemsDf.print.schema()

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

itemsDf.printSchema()

Correct! Here is an example of what itemsDf.printSchema() shows, you can see the tree-like structure containing both column names and types: root

|– itemId: integer (nullable = true)

|– attributes: array (nullable = true)

| |– element: string (containsNull = true)

|– supplier: string (nullable = true)

itemsDf.rdd.printSchema()

No, the DataFrame’s underlying RDD does not have a printSchema() method.

spark.schema(itemsDf)

Incorrect, there is no spark.schema command.

print(itemsDf.columns)

print(itemsDf.dtypes)

Wrong. While the output of this code blocks contains both column names and column types, the information is not arranges in a tree-like way.

itemsDf.print.schema()

No, DataFrame does not have a print method.

Static notebook | Dynamic notebook: See test 3,

Question #279

print(itemsDf.types)

B. itemsDf.printSchema()

C. spark.schema(itemsDf)

D. itemsDf.rdd.printSchema()

E. itemsDf.print.schema()

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

itemsDf.printSchema()

Correct! Here is an example of what itemsDf.printSchema() shows, you can see the tree-like structure containing both column names and types: root

|– itemId: integer (nullable = true)

|– attributes: array (nullable = true)

| |– element: string (containsNull = true)

|– supplier: string (nullable = true)

itemsDf.rdd.printSchema()

No, the DataFrame’s underlying RDD does not have a printSchema() method.

spark.schema(itemsDf)

Incorrect, there is no spark.schema command.

print(itemsDf.columns)

print(itemsDf.dtypes)

Wrong. While the output of this code blocks contains both column names and column types, the information is not arranges in a tree-like way.

itemsDf.print.schema()

No, DataFrame does not have a print method.

Static notebook | Dynamic notebook: See test 3,

Question #281

The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.

Code block:

transactionsDf.agg("storeId").avg("value")

A . Instead of avg("value"), avg(col("value")) should be used.
B . The avg("value") should be specified as a second argument to agg() instead of being appended to it.
C . All column names should be wrapped in col() operators.
D . agg should be replaced by groupBy.
E . "storeId" and "value" should be swapped.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Static notebook | Dynamic notebook: See test 1, 30.(Databricks import instructions) (https://flrs.github.io/spark_practice_tests_code/#1/30.html , https://bit.ly/sparkpracticeexams_import_instructions)

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #290

+——+———————————-+——————-+

Code block:

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

A . All column names need to be wrapped in the col() operator.
B . Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument "," needs to be passed to the split method.
C . Operator withColumnRenamed needs to be replaced with operator withColumn and the split method needs to be replaced by the splitString method.
D . Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument " " needs to be passed to the split method.
E . The expressions "itemNameElements" and split("itemName") need to be swapped.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Correct code block:

itemsDf.withColumn("itemNameElements", split("itemName"," "))

Output of code block:

+——+———————————-+——————-+——————————————+

|itemId|itemName |supplier |itemNameElements |

+——+———————————-+——————-+——————————————+

|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in, the, Snow]|

|2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress] |

|3 |Outdoors Backpack |Sports Company Inc.|[Outdoors, Backpack] |

+——+———————————-+——————-+——————————————+

The key to solving this QUESTION NO: is that the split method definitely needs a second argument here (also look at the link to the documentation below). Given the values in column itemName in

DataFrame itemsDf, this should be a space character " ". This is the character we need to split the words in the column.

More info: pyspark.sql.functions.split ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1, 46.(Databricks import instructions)

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #291

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

A . The arguments to the withColumn method need to be reordered.
B . The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.
C . The copy() operator should be appended to the code block to ensure a copy is returned.
D . Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.
E . The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

transactionsDf.withColumnRenamed("transactionId", "transactionNumber")

Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #343

transactionsDf.select(count_to_target_udf(‘predError’))

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

def count_to_target(target):

if target is None:

return

result = list(range(target))

return result

count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))

transactionsDf.select(count_to_target_udf(‘predError’))

Output of correct code block:

+————————–+

|count_to_target(predError)|

+————————–+

| [0, 1, 2]|

| [0, 1, 2, 3, 4, 5]|

| [0, 1, 2]|

| null|

| null|

| [0, 1, 2]|

+————————–+

This QUESTION NO: is not exactly easy. You need to be familiar with the syntax around

UDFs (user-defined functions). Specifically, in this QUESTION NO: it is important to pass the correct types to the udf

method – returning an array of a specific type rather than just a single type means you need to think harder about type implications than usual.

Remember that in Spark, you always pass types in an instantiated way like

ArrayType(IntegerType()), not like ArrayType(IntegerType). The parentheses () are the key here – make sure you do not forget those.

You should also pay attention that you actually pass the UDF count_to_target_udf, and not the Python method count_to_target to the select() operator.

Finally, null values are always a tricky case with UDFs. So, take care that the code can handle them correctly.

More info: How to Turn Python Functions into PySpark Functions (UDF) C Chang Hsin Lee C Committing my thoughts to words.

Static notebook | Dynamic notebook: See test 3, 24.(Databricks import instructions)

Question #344

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

A . transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
B . transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
C . transactionsDf.coalesce(10)
D . transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
E . transactionsDf.repartition(transactionsDf._partitions+2)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

Correct. The repartition operator is the correct one for increasing the number of partitions.

calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.

transactionsDf.coalesce(10)

No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.

transactionsDf.repartition(transactionsDf.getNumPartitions()+2) Incorrect, there is no getNumPartitions() method for the DataFrame class.

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.

transactionsDf.repartition(transactionsDf._partitions+2)

No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.

More info: pyspark.sql.DataFrame.repartition ― PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 23.(Databricks import instructions)

Question #344

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

A . transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
B . transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
C . transactionsDf.coalesce(10)
D . transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
E . transactionsDf.repartition(transactionsDf._partitions+2)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

Correct. The repartition operator is the correct one for increasing the number of partitions.

calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.

transactionsDf.coalesce(10)

No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.

transactionsDf.repartition(transactionsDf.getNumPartitions()+2) Incorrect, there is no getNumPartitions() method for the DataFrame class.

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.

transactionsDf.repartition(transactionsDf._partitions+2)

No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.

More info: pyspark.sql.DataFrame.repartition ― PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 23.(Databricks import instructions)

Question #344

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

A . transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
B . transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
C . transactionsDf.coalesce(10)
D . transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
E . transactionsDf.repartition(transactionsDf._partitions+2)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

Correct. The repartition operator is the correct one for increasing the number of partitions.

calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.

transactionsDf.coalesce(10)

No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.

transactionsDf.repartition(transactionsDf.getNumPartitions()+2) Incorrect, there is no getNumPartitions() method for the DataFrame class.

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.

transactionsDf.repartition(transactionsDf._partitions+2)

No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.

More info: pyspark.sql.DataFrame.repartition ― PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 23.(Databricks import instructions)

Question #344

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

A . transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
B . transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
C . transactionsDf.coalesce(10)
D . transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
E . transactionsDf.repartition(transactionsDf._partitions+2)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

Correct. The repartition operator is the correct one for increasing the number of partitions.

calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.

transactionsDf.coalesce(10)

No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.

transactionsDf.repartition(transactionsDf.getNumPartitions()+2) Incorrect, there is no getNumPartitions() method for the DataFrame class.

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.

transactionsDf.repartition(transactionsDf._partitions+2)

No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.

More info: pyspark.sql.DataFrame.repartition ― PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 23.(Databricks import instructions)

Question #344

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

A . transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
B . transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
C . transactionsDf.coalesce(10)
D . transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
E . transactionsDf.repartition(transactionsDf._partitions+2)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

Correct. The repartition operator is the correct one for increasing the number of partitions.

calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.

transactionsDf.coalesce(10)

No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.

transactionsDf.repartition(transactionsDf.getNumPartitions()+2) Incorrect, there is no getNumPartitions() method for the DataFrame class.

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.

transactionsDf.repartition(transactionsDf._partitions+2)

No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.

More info: pyspark.sql.DataFrame.repartition ― PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 23.(Databricks import instructions)

Question #344

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

A . transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
B . transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
C . transactionsDf.coalesce(10)
D . transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
E . transactionsDf.repartition(transactionsDf._partitions+2)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

Correct. The repartition operator is the correct one for increasing the number of partitions.

calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.

transactionsDf.coalesce(10)

No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.

transactionsDf.repartition(transactionsDf.getNumPartitions()+2) Incorrect, there is no getNumPartitions() method for the DataFrame class.

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.

transactionsDf.repartition(transactionsDf._partitions+2)

No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.

More info: pyspark.sql.DataFrame.repartition ― PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 23.(Databricks import instructions)

Question #344

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

A . transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
B . transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
C . transactionsDf.coalesce(10)
D . transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
E . transactionsDf.repartition(transactionsDf._partitions+2)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

Correct. The repartition operator is the correct one for increasing the number of partitions.

calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.

transactionsDf.coalesce(10)

No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.

transactionsDf.repartition(transactionsDf.getNumPartitions()+2) Incorrect, there is no getNumPartitions() method for the DataFrame class.

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.

transactionsDf.repartition(transactionsDf._partitions+2)

No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.

More info: pyspark.sql.DataFrame.repartition ― PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 23.(Databricks import instructions)

Question #344

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

A . transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
B . transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
C . transactionsDf.coalesce(10)
D . transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
E . transactionsDf.repartition(transactionsDf._partitions+2)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

Correct. The repartition operator is the correct one for increasing the number of partitions.

calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.

transactionsDf.coalesce(10)

No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.

transactionsDf.repartition(transactionsDf.getNumPartitions()+2) Incorrect, there is no getNumPartitions() method for the DataFrame class.

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.

transactionsDf.repartition(transactionsDf._partitions+2)

No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.

More info: pyspark.sql.DataFrame.repartition ― PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 23.(Databricks import instructions)

Question #352

part-00003-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-301-1-

c000.csv.gz

spark.option("header",True).csv(filePath)

A . spark.read.format("csv").option("header",True).option("compression","zip").load(filePath)
B . spark.read().option("header",True).load(filePath)
C . spark.read.format("csv").option("header",True).load(filePath)
D . spark.read.load(filePath)

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The files in directory filePath are partitions of a DataFrame that have been exported using gzip compression. Spark automatically recognizes this situation and imports the CSV files as separate partitions into a single DataFrame. It is, however, necessary to specify that Spark should load the file headers in the CSV with the header option, which is set to False by default.

Question #353

Which of the following statements about stages is correct?

A . Different stages in a job may be executed in parallel.
B . Stages consist of one or more jobs.
C . Stages ephemerally store transactions, before they are committed through actions.
D . Tasks in a stage may be executed by multiple machines at the same time.
E . Stages may contain multiple actions, narrow, and wide transformations.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Tasks in a stage may be executed by multiple machines at the same time.

This is correct. Within a single stage, tasks do not depend on each other. Executors on multiple machines may execute tasks belonging to the same stage on the respective partitions they are

holding at the same time.

Different stages in a job may be executed in parallel.

No. Different stages in a job depend on each other and cannot be executed in parallel. The nuance is that every task in a stage may be executed in parallel by multiple machines.

For example, if a job consists of Stage A and Stage B, tasks belonging to those stages may not be executed in parallel. However, tasks from Stage A may be executed on multiple machines at the

same time, with each machine running it on a different partition of the same dataset. Then, afterwards, tasks from Stage B may be executed on multiple machines at the same time. Stages may contain multiple actions, narrow, and wide transformations.

No, stages may not contain multiple wide transformations. Wide transformations mean that shuffling is required. Shuffling typically terminates a stage though, because data needs to be exchanged

across the cluster. This data exchange often causes partitions to change and rearrange, making it impossible to perform tasks in parallel on the same dataset.

Stages ephemerally store transactions, before they are committed through actions. No, this does not make sense. Stages do not "store" any data. Transactions are not "committed" in Spark.

Stages consist of one or more jobs.

No, it is the other way around: Jobs consist of one more stages.

More info: Spark: The Definitive Guide, Chapter 15.

Question #354

Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a

DataFrame?

A . spark.mode("parquet").read("/FileStore/imports.parquet")
B . spark.read.path("/FileStore/imports.parquet", source="parquet")
C . spark.read().parquet("/FileStore/imports.parquet")
D . spark.read.parquet("/FileStore/imports.parquet")
E . spark.read().format(‘parquet’).open("/FileStore/imports.parquet")

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Static notebook | Dynamic notebook: See test 1,

Question #354

Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a

DataFrame?

A . spark.mode("parquet").read("/FileStore/imports.parquet")
B . spark.read.path("/FileStore/imports.parquet", source="parquet")
C . spark.read().parquet("/FileStore/imports.parquet")
D . spark.read.parquet("/FileStore/imports.parquet")
E . spark.read().format(‘parquet’).open("/FileStore/imports.parquet")

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Static notebook | Dynamic notebook: See test 1,

Question #356

Which of the following describes the difference between client and cluster execution modes?

A . In cluster mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine.
B . In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.
C . In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.
D . In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.
E . In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.

This is wrong, since execution modes do not specify whether workloads are run in the cloud or on-premise.

In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.

Wrong, since in both cases executors run on worker nodes.

In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.

Wrong C in cluster mode, the driver runs on a worker node. In client mode, the driver runs on the client machine.

In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.

No. In both modes, the cluster manager is typically on a separate node C not on the same host as the driver. It only runs on the same host as the driver in local execution mode. More info: Learning Spark, 2nd Edition, Chapter 1, and Spark: The Definitive Guide, Chapter 15. ()

Question #357

Which of the following statements about Spark’s configuration properties is incorrect?

A . The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property.
B . The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property.
C . The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.
D . The default number of partitions to use when shuffling data for joins or aggregations is 300.
E . The default number of partitions returned from certain transformations can be controlled by the spark.default.parallelism property.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The default number of partitions to use when shuffling data for joins or aggregations is 300.

No, the default value of the applicable property spark.sql.shuffle.partitions is 200.

The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property.

Correct, see below.

The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property.

Correct, the maximum number of tasks that an executor can process in parallel depends on both properties spark.task.cpus and spark.executor.cores. This is because the available number of slots is calculated by dividing the number of cores per executor by the number of cores per task. For more info specifically to this point, check out Spark Architecture | Distributed Systems Architecture.

More info: Configuration – Spark 3.1.2 Documentation

Question #358

Which of the following describes a valid concern about partitioning?

A . A shuffle operation returns 200 partitions if not explicitly set.
B . Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.
C . No data is exchanged between executors when coalesce() is run.
D . Short partition processing times are indicative of low skew.
E . The coalesce() method should be used to increase the number of partitions.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

A shuffle operation returns 200 partitions if not explicitly set.

Correct. 200 is the default value for the Spark property spark.sql.shuffle.partitions. This property determines how many partitions Spark uses when shuffling data for joins or aggregations.

The coalesce() method should be used to increase the number of partitions.

Incorrect. The coalesce() method can only be used to decrease the number of partitions. Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

No. For narrow transformations, fewer partitions usually result in a longer overall runtime, if more executors are available than partitions.

A narrow transformation does not include a shuffle, so no data need to be exchanged between executors. Shuffles are expensive and can be a bottleneck for executing Spark workloads.

Narrow transformations, however, are executed on a per-partition basis, blocking one executor per partition. So, it matters how many executors are available to perform work in parallel relative to the number of partitions. If the number of executors is greater than the number of partitions, then some executors are idle while other process the partitions. On the flip side, if the number of executors is smaller than the number of partitions, the entire operation can only be finished after some executors have processed multiple partitions, one after the other. To minimize the overall runtime, one would want to have the number of partitions equal to the number of executors (but not

more).

So, for the scenario at hand, increasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions. No data is exchanged between executors when coalesce() is run.

No. While coalesce() avoids a full shuffle, it may still cause a partial shuffle, resulting in data exchange between executors.

Short partition processing times are indicative of low skew.

Incorrect. Data skew means that data is distributed unevenly over the partitions of a dataset. Low skew therefore means that data is distributed evenly.

Partition processing time, the time that executors take to process partitions, can be indicative of skew if some executors take a long time to process a partition, but others do not. However, a short processing time is not per se indicative a low skew: It may simply be short because the partition is small.

A situation indicative of low skew may be when all executors finish processing their partitions in the same timeframe. High skew may be indicated by some executors taking much longer to finish their partitions than others. But the answer does not make any comparison C so by itself it does not provide enough information to make any assessment about skew.

More info: Spark Repartition & Coalesce – Explained and Performance Tuning – Spark 3.1.2 Documentation

Question #359

Which of the following code blocks selects all rows from DataFrame transactionsDf in which column productId is zero or smaller or equal to 3?

A . transactionsDf.filter(productId==3 or productId<1)
B . transactionsDf.filter((col("productId")==3) or (col("productId")<1))
C . transactionsDf.filter(col("productId")==3 | col("productId")<1)
D . transactionsDf.where("productId"=3).or("productId"<1))
E . transactionsDf.filter((col("productId")==3) | (col("productId")<1))

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

This QUESTION NO: targets your knowledge about how to chain filtering conditions. Each filtering condition should be in parentheses. The correct operator for "or" is the pipe

character (|) and not

the word or. Another operator of concern is the equality operator. For the purpose of comparison, equality is expressed as two equal signs (==).

Static notebook | Dynamic notebook: See test 2,

Question #359

Which of the following code blocks selects all rows from DataFrame transactionsDf in which column productId is zero or smaller or equal to 3?

A . transactionsDf.filter(productId==3 or productId<1)
B . transactionsDf.filter((col("productId")==3) or (col("productId")<1))
C . transactionsDf.filter(col("productId")==3 | col("productId")<1)
D . transactionsDf.where("productId"=3).or("productId"<1))
E . transactionsDf.filter((col("productId")==3) | (col("productId")<1))

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

This QUESTION NO: targets your knowledge about how to chain filtering conditions. Each filtering condition should be in parentheses. The correct operator for "or" is the pipe

character (|) and not

the word or. Another operator of concern is the equality operator. For the purpose of comparison, equality is expressed as two equal signs (==).

Static notebook | Dynamic notebook: See test 2,

Question #361

Which of the following describes a difference between Spark’s cluster and client execution modes?

A . In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client mode.
B . In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.
C . In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.
D . In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.
E . In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.

Correct. The idea of Spark’s client mode is that workloads can be executed from an edge node, also known as gateway machine, from outside the cluster. The most common way to execute Spark

however is in cluster mode, where the driver resides on a worker node.

In practice, in client mode, there are tight constraints about the data transfer speed relative to the data transfer speed between worker nodes in the cluster. Also, any job in that is executed in client

mode will fail if the edge node fails. For these reasons, client mode is usually not used in a production environment.

In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client execution mode.

No. In both execution modes, the cluster manager may reside on a worker node, but it

does not reside on an edge node in client mode.

In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.

This is incorrect. Only the driver runs on gateway nodes (also known as "edge nodes") in client mode, but not the executor processes.

In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.

No, in client mode, the Spark driver is not co-located with the driver. The whole point of client mode is that the driver is outside the cluster and not associated with the resource that manages the

cluster (the machine that runs the cluster manager).

In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.

No, it is exactly the opposite: There are no gateway machines in cluster mode, but in client mode, they host the driver.

Question #362

Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

A . spark.read.json(filePath)
B . spark.read.path(filePath, source="json")
C . spark.read().path(filePath)
D . spark.read().json(filePath)
E . spark.read.path(filePath)

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

spark.read.json(filePath)

Correct. spark.read accesses Spark’s DataFrameReader. Then, Spark identifies the file type to be read as JSON type by passing filePath into the DataFrameReader.json() method.

spark.read.path(filePath)

Incorrect. Spark’s DataFrameReader does not have a path method. A universal way to read in files is provided by the DataFrameReader.load() method (link below).

spark.read.path(filePath, source="json")

Wrong. A DataFrameReader.path() method does not exist (see above).

spark.read().json(filePath)

Incorrect. spark.read is a way to access Spark’s DataFrameReader. However, the DataFrameReader is not callable, so calling it via spark.read() will fail.

spark.read().path(filePath)

No, Spark’s DataFrameReader is not callable (see above).

More info: pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation, pyspark.sql.DataFrameReader.load ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #362

Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

A . spark.read.json(filePath)
B . spark.read.path(filePath, source="json")
C . spark.read().path(filePath)
D . spark.read().json(filePath)
E . spark.read.path(filePath)

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

spark.read.json(filePath)

Correct. spark.read accesses Spark’s DataFrameReader. Then, Spark identifies the file type to be read as JSON type by passing filePath into the DataFrameReader.json() method.

spark.read.path(filePath)

Incorrect. Spark’s DataFrameReader does not have a path method. A universal way to read in files is provided by the DataFrameReader.load() method (link below).

spark.read.path(filePath, source="json")

Wrong. A DataFrameReader.path() method does not exist (see above).

spark.read().json(filePath)

Incorrect. spark.read is a way to access Spark’s DataFrameReader. However, the DataFrameReader is not callable, so calling it via spark.read() will fail.

spark.read().path(filePath)

No, Spark’s DataFrameReader is not callable (see above).

More info: pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation, pyspark.sql.DataFrameReader.load ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #364

The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by

column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items

in column value. Find the error.

Code block:

transactionsDf.orderBy(‘value’, asc_nulls_first(col(‘predError’)))

A . Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.
B . Column value should be wrapped by the col() operator.
C . Column predError should be sorted in a descending way, putting nulls last.
D . Column predError should be sorted by desc_nulls_first() instead.
E . Instead of orderBy, sort should be used.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Correct code block:

transactionsDf.orderBy(‘value’, desc_nulls_last(‘predError’))

Column predError should be sorted in a descending way, putting nulls last.

Correct! By default, Spark sorts ascending, putting nulls first. So, the inverse sort of the default sort is indeed desc_nulls_last.

Instead of orderBy, sort should be used.

No. DataFrame.sort() orders data per partition, it does not guarantee a global order. This is why orderBy is the more appropriate operator here.

Column value should be wrapped by the col() operator.

Incorrect. DataFrame.sort() accepts both string and Column objects.

Column predError should be sorted by desc_nulls_first() instead.

Wrong. Since Spark’s default sort order matches asc_nulls_first(), nulls would have to come last when inverted.

Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.

No, this would just sort the DataFrame by the very last column, but would not take information from both columns into account, as noted in the question.

More info: pyspark.sql.DataFrame.orderBy ― PySpark 3.1.2 documentation,

pyspark.sql.functions.desc_nulls_last ― PySpark 3.1.2 documentation, sort() vs orderBy()

in Spark | Towards Data

Science

Static notebook | Dynamic notebook: See test 3, 27.(Databricks import instructions)

Question #365

Which of the following code blocks removes all rows in the 6-column DataFrame transactionsDf that have missing data in at least 3 columns?

A . transactionsDf.dropna("any")
B . transactionsDf.dropna(thresh=4)
C . transactionsDf.drop.na("",2)
D . transactionsDf.dropna(thresh=2)
E . transactionsDf.dropna("",4)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

transactionsDf.dropna(thresh=4)

Correct. Note that by only working with the thresh keyword argument, the first how keyword argument is ignored. Also, figuring out which value to set for thresh can be difficult, especially when

under pressure in the exam. Here, I recommend you use the notes to create a "simulation" of what different values for thresh would do to a DataFrame. Here is an explanatory image why thresh=4 is

the correct answer to the question:

transactionsDf.dropna(thresh=2)

Almost right. See the comment about thresh for the correct answer above. transactionsDf.dropna("any")

No, this would remove all rows that have at least one missing value.

transactionsDf.drop.na("",2)

No, drop.na is not a proper DataFrame method.

transactionsDf.dropna("",4)

No, this does not work and will throw an error in Spark because Spark cannot understand the first argument.

More info: pyspark.sql.DataFrame.dropna ― PySpark 3.1.1 documentation (https://bit.ly/2QZpiCp)

Static notebook | Dynamic notebook: See test 1,

Question #365

Which of the following code blocks removes all rows in the 6-column DataFrame transactionsDf that have missing data in at least 3 columns?

A . transactionsDf.dropna("any")
B . transactionsDf.dropna(thresh=4)
C . transactionsDf.drop.na("",2)
D . transactionsDf.dropna(thresh=2)
E . transactionsDf.dropna("",4)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

transactionsDf.dropna(thresh=4)

Correct. Note that by only working with the thresh keyword argument, the first how keyword argument is ignored. Also, figuring out which value to set for thresh can be difficult, especially when

under pressure in the exam. Here, I recommend you use the notes to create a "simulation" of what different values for thresh would do to a DataFrame. Here is an explanatory image why thresh=4 is

the correct answer to the question:

transactionsDf.dropna(thresh=2)

Almost right. See the comment about thresh for the correct answer above. transactionsDf.dropna("any")

No, this would remove all rows that have at least one missing value.

transactionsDf.drop.na("",2)

No, drop.na is not a proper DataFrame method.

transactionsDf.dropna("",4)

No, this does not work and will throw an error in Spark because Spark cannot understand the first argument.

More info: pyspark.sql.DataFrame.dropna ― PySpark 3.1.1 documentation (https://bit.ly/2QZpiCp)

Static notebook | Dynamic notebook: See test 1,

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #367

Which of the following describes tasks?

A . A task is a command sent from the driver to the executors in response to a transformation.
B . Tasks transform jobs into DAGs.
C . A task is a collection of slots.
D . A task is a collection of rows.
E . Tasks get assigned to the executors by the driver.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks get assigned to the executors by the driver.

Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.

No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more

tasks.

A task is a collection of rows.

Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.

A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors

only in response to actions.

A task is a collection of slots.

No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Question #383

5

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

The correct code block is:

transactionsDf.filter(col("storeId")==25).take(5)

Any of the options with collect will not work because collect does not take any arguments, and in both cases the argument 5 is given.

The option with toLocalIterator will not work because the only argument to toLocalIterator is prefetchPartitions which is a boolean, so passing 5 here does not make sense.

The option using head will not work because the expression passed to select is not proper syntax. It would work if the expression would be col("storeId")==25.

Static notebook | Dynamic notebook: See test 1,

Question #383

5

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

The correct code block is:

transactionsDf.filter(col("storeId")==25).take(5)

Any of the options with collect will not work because collect does not take any arguments, and in both cases the argument 5 is given.

The option with toLocalIterator will not work because the only argument to toLocalIterator is prefetchPartitions which is a boolean, so passing 5 here does not make sense.

The option using head will not work because the expression passed to select is not proper syntax. It would work if the expression would be col("storeId")==25.

Static notebook | Dynamic notebook: See test 1,

Question #385

Which of the following is a problem with using accumulators?

A . Only unnamed accumulators can be inspected in the Spark UI.
B . Only numeric values can be used in accumulators.
C . Accumulator values can only be read by the driver, but not by executors.
D . Accumulators do not obey lazy evaluation.
E . Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Accumulator values can only be read by the driver, but not by executors.

Correct. So, for example, you cannot use an accumulator variable for coordinating workloads between executors. The typical, canonical, use case of an accumulator value is to report data, for example for debugging purposes, back to the driver. For example, if you wanted to count values that match a specific condition in a UDF for debugging purposes, an accumulator provides a good way to do that.

Only numeric values can be used in accumulators.

No. While pySpark’s Accumulator only supports numeric values (think int and float), you can define accumulators for custom types via the AccumulatorParam interface (documentation linked below).

Accumulators do not obey lazy evaluation.

Incorrect C accumulators do obey lazy evaluation. This has implications in practice: When an accumulator is encapsulated in a transformation, that accumulator will not be modified until a subsequent action is run.

Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Wrong. A concern with accumulators is in fact that under certain conditions they can run for each task more than once. For example, if a hardware failure occurs during a task after an accumulator variable has been increased but before a task has finished and Spark launches the task on a different worker in response to the failure, already executed accumulator variable increases will be repeated.

Only unnamed accumulators can be inspected in the Spark UI.

No. Currently, in PySpark, no accumulators can be inspected in the Spark UI. In the Scala interface of Spark, only named accumulators can be inspected in the Spark UI.

More info: Aggregating Results with Spark Accumulators | Sparkour, RDD Programming Guide – Spark 3.1.2 Documentation, pyspark.Accumulator ― PySpark 3.1.2

documentation, and pyspark.AccumulatorParam ― PySpark 3.1.2 documentation

Question #386

The code block displayed below contains an error. The code block should combine data from DataFrames itemsDf and transactionsDf, showing all rows of DataFrame itemsDf that have a matching value in column itemId with a value in column transactionsId of DataFrame transactionsDf.

Find the error.

Code block:

itemsDf.join(itemsDf.itemId==transactionsDf.transactionId)

A . The join statement is incomplete.
B . The union method should be used instead of join.
C . The join method is inappropriate.
D . The merge method should be used instead of join.
E . The join expression is malformed.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Correct code block:

itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.transactionId)

The join statement is incomplete.

Correct! If you look at the documentation of DataFrame.join() (linked below), you see that the very first argument of join should be the DataFrame that should be joined with. This first argument is

missing in the code block.

The join method is inappropriate.

No. By default, DataFrame.join() uses an inner join. This method is appropriate for the scenario described in the question.

The join expression is malformed.

Incorrect. The join expression itemsDf.itemId==transactionsDf.transactionId is correct syntax.

The merge method should be used instead of join.

False. There is no DataFrame.merge() method in PySpark.

The union method should be used instead of join.

Wrong. DataFrame.union() merges rows, but not columns as requested in the question.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation, pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 44.(Databricks import instructions)

Question #387

Which of the following statements about lazy evaluation is incorrect?

A . Predicate pushdown is a feature resulting from lazy evaluation.
B . Execution is triggered by transformations.
C . Spark will fail a job only during execution, but not during definition.
D . Accumulators do not change the lazy evaluation model of Spark.
E . Lineages allow Spark to coalesce transformations into stages

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Execution is triggered by transformations.

Correct. Execution is triggered by actions only, not by transformations.

Lineages allow Spark to coalesce transformations into stages.

Incorrect. In Spark, lineage means a recording of transformations. This lineage enables lazy evaluation in Spark.

Predicate pushdown is a feature resulting from lazy evaluation.

Wrong. Predicate pushdown means that, for example, Spark will execute filters as early in the process as possible so that it deals with the least possible amount of data in subsequent transformations, resulting in a performance improvements.

Accumulators do not change the lazy evaluation model of Spark.

Incorrect. In Spark, accumulators are only updated when the query that refers to the is actually executed. In other words, they are not updated if the query is not (yet) executed due to lazy evaluation.

Spark will fail a job only during execution, but not during definition.

Wrong. During definition, due to lazy evaluation, the job is not executed and thus certain errors, for example reading from a non-existing file, cannot be caught. To be caught, the job needs to be executed, for example through an action.

Question #388

Which of the following are valid execution modes?

A . Kubernetes, Local, Client
B . Client, Cluster, Local
C . Server, Standalone, Client
D . Cluster, Server, Local
E . Standalone, Client, Cluster

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

This is a tricky QUESTION NO: to get right, since it is easy to confuse execution modes and deployment modes. Even in literature, both terms are sometimes used interchangeably.

There are only 3 valid execution modes in Spark: Client, cluster, and local execution modes. Execution modes do not refer to specific frameworks, but to where infrastructure is

located with respect to each other.

In client mode, the driver sits on a machine outside the cluster. In cluster mode, the driver sits on a machine inside the cluster. Finally, in local mode, all Spark infrastructure is started in a single JVM

(Java Virtual Machine) in a single computer which then also includes the driver. Deployment modes often refer to ways that Spark can be deployed in cluster mode and how it uses specific frameworks outside Spark. Valid deployment modes are standalone, Apache YARN,

Apache Mesos and Kubernetes.

Client, Cluster, Local

Correct, all of these are the valid execution modes in Spark.

Standalone, Client, Cluster

No, standalone is not a valid execution mode. It is a valid deployment mode, though. Kubernetes, Local, Client

No, Kubernetes is a deployment mode, but not an execution mode.

Cluster, Server, Local

No, Server is not an execution mode.

Server, Standalone, Client

No, standalone and server are not execution modes. More info: Apache Spark Internals – Learning Journal

Question #389

Which of the following describes characteristics of the Dataset API?

A . The Dataset API does not support unstructured data.
B . In Python, the Dataset API mainly resembles Pandas’ DataFrame API.
C . In Python, the Dataset API’s schema is constructed via type hints.
D . The Dataset API is available in Scala, but it is not available in Python.
E . The Dataset API does not provide compile-time type safety.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The Dataset API is available in Scala, but it is not available in Python.

Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In

Python, you use the DataFrame API, which is based on the Dataset API.

The Dataset API does not provide compile-time type safety.

No C in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.

The Dataset API does not support unstructured data.

Wrong, the Dataset API supports structured and unstructured data.

In Python, the Dataset API’s schema is constructed via type hints.

No, this is not applicable since the Dataset API is not available in Python.

In Python, the Dataset API mainly resembles Pandas’ DataFrame API.

The Dataset API does not exist in Python, only in Scala and Java.

Question #389

Which of the following describes characteristics of the Dataset API?

A . The Dataset API does not support unstructured data.
B . In Python, the Dataset API mainly resembles Pandas’ DataFrame API.
C . In Python, the Dataset API’s schema is constructed via type hints.
D . The Dataset API is available in Scala, but it is not available in Python.
E . The Dataset API does not provide compile-time type safety.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The Dataset API is available in Scala, but it is not available in Python.

Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In

Python, you use the DataFrame API, which is based on the Dataset API.

The Dataset API does not provide compile-time type safety.

No C in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.

The Dataset API does not support unstructured data.

Wrong, the Dataset API supports structured and unstructured data.

In Python, the Dataset API’s schema is constructed via type hints.

No, this is not applicable since the Dataset API is not available in Python.

In Python, the Dataset API mainly resembles Pandas’ DataFrame API.

The Dataset API does not exist in Python, only in Scala and Java.

Question #389

Which of the following describes characteristics of the Dataset API?

A . The Dataset API does not support unstructured data.
B . In Python, the Dataset API mainly resembles Pandas’ DataFrame API.
C . In Python, the Dataset API’s schema is constructed via type hints.
D . The Dataset API is available in Scala, but it is not available in Python.
E . The Dataset API does not provide compile-time type safety.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The Dataset API is available in Scala, but it is not available in Python.

Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In

Python, you use the DataFrame API, which is based on the Dataset API.

The Dataset API does not provide compile-time type safety.

No C in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.

The Dataset API does not support unstructured data.

Wrong, the Dataset API supports structured and unstructured data.

In Python, the Dataset API’s schema is constructed via type hints.

No, this is not applicable since the Dataset API is not available in Python.

In Python, the Dataset API mainly resembles Pandas’ DataFrame API.

The Dataset API does not exist in Python, only in Scala and Java.

Question #389

Which of the following describes characteristics of the Dataset API?

A . The Dataset API does not support unstructured data.
B . In Python, the Dataset API mainly resembles Pandas’ DataFrame API.
C . In Python, the Dataset API’s schema is constructed via type hints.
D . The Dataset API is available in Scala, but it is not available in Python.
E . The Dataset API does not provide compile-time type safety.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The Dataset API is available in Scala, but it is not available in Python.

Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In

Python, you use the DataFrame API, which is based on the Dataset API.

The Dataset API does not provide compile-time type safety.

No C in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.

The Dataset API does not support unstructured data.

Wrong, the Dataset API supports structured and unstructured data.

In Python, the Dataset API’s schema is constructed via type hints.

No, this is not applicable since the Dataset API is not available in Python.

In Python, the Dataset API mainly resembles Pandas’ DataFrame API.

The Dataset API does not exist in Python, only in Scala and Java.

Question #389

Which of the following describes characteristics of the Dataset API?

A . The Dataset API does not support unstructured data.
B . In Python, the Dataset API mainly resembles Pandas’ DataFrame API.
C . In Python, the Dataset API’s schema is constructed via type hints.
D . The Dataset API is available in Scala, but it is not available in Python.
E . The Dataset API does not provide compile-time type safety.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The Dataset API is available in Scala, but it is not available in Python.

Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In

Python, you use the DataFrame API, which is based on the Dataset API.

The Dataset API does not provide compile-time type safety.

No C in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.

The Dataset API does not support unstructured data.

Wrong, the Dataset API supports structured and unstructured data.

In Python, the Dataset API’s schema is constructed via type hints.

No, this is not applicable since the Dataset API is not available in Python.

In Python, the Dataset API mainly resembles Pandas’ DataFrame API.

The Dataset API does not exist in Python, only in Scala and Java.

Question #389

Which of the following describes characteristics of the Dataset API?

A . The Dataset API does not support unstructured data.
B . In Python, the Dataset API mainly resembles Pandas’ DataFrame API.
C . In Python, the Dataset API’s schema is constructed via type hints.
D . The Dataset API is available in Scala, but it is not available in Python.
E . The Dataset API does not provide compile-time type safety.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The Dataset API is available in Scala, but it is not available in Python.

Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In

Python, you use the DataFrame API, which is based on the Dataset API.

The Dataset API does not provide compile-time type safety.

No C in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.

The Dataset API does not support unstructured data.

Wrong, the Dataset API supports structured and unstructured data.

In Python, the Dataset API’s schema is constructed via type hints.

No, this is not applicable since the Dataset API is not available in Python.

In Python, the Dataset API mainly resembles Pandas’ DataFrame API.

The Dataset API does not exist in Python, only in Scala and Java.

Question #389

Which of the following describes characteristics of the Dataset API?

A . The Dataset API does not support unstructured data.
B . In Python, the Dataset API mainly resembles Pandas’ DataFrame API.
C . In Python, the Dataset API’s schema is constructed via type hints.
D . The Dataset API is available in Scala, but it is not available in Python.
E . The Dataset API does not provide compile-time type safety.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The Dataset API is available in Scala, but it is not available in Python.

Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In

Python, you use the DataFrame API, which is based on the Dataset API.

The Dataset API does not provide compile-time type safety.

No C in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.

The Dataset API does not support unstructured data.

Wrong, the Dataset API supports structured and unstructured data.

In Python, the Dataset API’s schema is constructed via type hints.

No, this is not applicable since the Dataset API is not available in Python.

In Python, the Dataset API mainly resembles Pandas’ DataFrame API.

The Dataset API does not exist in Python, only in Scala and Java.

Question #389

Which of the following describes characteristics of the Dataset API?

A . The Dataset API does not support unstructured data.
B . In Python, the Dataset API mainly resembles Pandas’ DataFrame API.
C . In Python, the Dataset API’s schema is constructed via type hints.
D . The Dataset API is available in Scala, but it is not available in Python.
E . The Dataset API does not provide compile-time type safety.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The Dataset API is available in Scala, but it is not available in Python.

Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In

Python, you use the DataFrame API, which is based on the Dataset API.

The Dataset API does not provide compile-time type safety.

No C in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.

The Dataset API does not support unstructured data.

Wrong, the Dataset API supports structured and unstructured data.

In Python, the Dataset API’s schema is constructed via type hints.

No, this is not applicable since the Dataset API is not available in Python.

In Python, the Dataset API mainly resembles Pandas’ DataFrame API.

The Dataset API does not exist in Python, only in Scala and Java.

Question #397

+——+—————————–+——————-+

A . itemsDf.withColumn(‘attributes’, sort_array(col(‘attributes’).desc()))
B . itemsDf.withColumn(‘attributes’, sort_array(desc(‘attributes’)))
C . itemsDf.withColumn(‘attributes’, sort(col(‘attributes’), asc=False))
D . itemsDf.withColumn("attributes", sort_array("attributes", asc=False))
E . itemsDf.select(sort_array("attributes"))

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Output of correct code block:

+——+—————————–+——————-+

|itemId|attributes |supplier |

+——+—————————–+——————-+

|1 |[winter, cozy, blue] |Sports Company Inc.|

|2 |[summer, red, fresh, cooling]|YetiX |

|3 |[travel, summer, green] |Sports Company Inc.|

+——+—————————–+——————-+

It can be confusing to differentiate between the different sorting functions in PySpark. In this case, a particularity about sort_array has to be considered: The sort direction is given by the second argument, not by the desc method. Luckily, this is documented in the documentation (link below). Also, for solving this QUESTION NO: you need to understand the difference between sort and sort_array. With sort, you cannot sort values in arrays. Also, sort is a method of DataFrame, while sort_array is a method of pyspark.sql.functions.

More info: pyspark.sql.functions.sort_array ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2,32.(Databricks import instructions)

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #398

Which is the highest level in Spark’s execution hierarchy?

A . Task
B . Executor
C . Slot
D . Job
E . Stage

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Question #410

spark.sql ("FROM transactionsDf SELECT predError, value WHERE transactionId % 2 = 2")

F. transactionsDf.filter(col(transactionId).isin([3,4,6]))

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Output of correct code block:

+———+—–+

|predError|value|

+———+—–+

| 6| 7|

| null| null|

| 3| 2|

+———+—–+

This is not an easy QUESTION NO: to solve. You need to know that % stands for the module operator in Python. % 2 will return true for every second row. The statement using spark.sql gets it

almost right (the modulo operator exists in SQL as well), but % 2 = 2 will never yield true, since modulo 2 is either 0 or 1.

Other answers are wrong since they are missing quotes around the column names and/or use filter or select incorrectly.

If you have any doubts about SparkSQL and answer options 3 and 4 in this question, check out the notebook I created as a response to a related student question.

Static notebook | Dynamic notebook: See test 1,53.(Databricks import instructions)

Question #411

Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

A . Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
B . Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
C . Use a narrow transformation to reduce the number of partitions.
D . Use a wide transformation to reduce the number of partitions.
Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Use a narrow transformation to reduce the number of partitions.

Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the

DataFrame.

Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.

Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" C this is something the coalesce operation will not do. As a general rule, it will reduce the number of partitions with the very least movement of data possible. More info: distributed computing – Spark – repartition() vs coalesce() – Stack Overflow Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info:

pyspark.sql.DataFrame.coalesce ―

PySpark 3.1.2 documentation

Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.

No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient

way of reducing the number of partitions of all listed options.

Use a wide transformation to reduce the number of partitions.

No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.

Question #411

Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

A . Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
B . Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
C . Use a narrow transformation to reduce the number of partitions.
D . Use a wide transformation to reduce the number of partitions.
Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Use a narrow transformation to reduce the number of partitions.

Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the

DataFrame.

Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.

Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" C this is something the coalesce operation will not do. As a general rule, it will reduce the number of partitions with the very least movement of data possible. More info: distributed computing – Spark – repartition() vs coalesce() – Stack Overflow Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info:

pyspark.sql.DataFrame.coalesce ―

PySpark 3.1.2 documentation

Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.

No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient

way of reducing the number of partitions of all listed options.

Use a wide transformation to reduce the number of partitions.

No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.

Question #411

Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

A . Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
B . Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
C . Use a narrow transformation to reduce the number of partitions.
D . Use a wide transformation to reduce the number of partitions.
Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Use a narrow transformation to reduce the number of partitions.

Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the

DataFrame.

Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.

Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" C this is something the coalesce operation will not do. As a general rule, it will reduce the number of partitions with the very least movement of data possible. More info: distributed computing – Spark – repartition() vs coalesce() – Stack Overflow Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info:

pyspark.sql.DataFrame.coalesce ―

PySpark 3.1.2 documentation

Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.

No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient

way of reducing the number of partitions of all listed options.

Use a wide transformation to reduce the number of partitions.

No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.

Question #411

Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

A . Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
B . Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
C . Use a narrow transformation to reduce the number of partitions.
D . Use a wide transformation to reduce the number of partitions.
Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Use a narrow transformation to reduce the number of partitions.

Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the

DataFrame.

Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.

Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" C this is something the coalesce operation will not do. As a general rule, it will reduce the number of partitions with the very least movement of data possible. More info: distributed computing – Spark – repartition() vs coalesce() – Stack Overflow Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info:

pyspark.sql.DataFrame.coalesce ―

PySpark 3.1.2 documentation

Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.

No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient

way of reducing the number of partitions of all listed options.

Use a wide transformation to reduce the number of partitions.

No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.

Question #411

Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

A . Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
B . Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
C . Use a narrow transformation to reduce the number of partitions.
D . Use a wide transformation to reduce the number of partitions.
Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Use a narrow transformation to reduce the number of partitions.

Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the

DataFrame.

Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.

Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" C this is something the coalesce operation will not do. As a general rule, it will reduce the number of partitions with the very least movement of data possible. More info: distributed computing – Spark – repartition() vs coalesce() – Stack Overflow Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info:

pyspark.sql.DataFrame.coalesce ―

PySpark 3.1.2 documentation

Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.

No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient

way of reducing the number of partitions of all listed options.

Use a wide transformation to reduce the number of partitions.

No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.

Question #411

Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

A . Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
B . Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
C . Use a narrow transformation to reduce the number of partitions.
D . Use a wide transformation to reduce the number of partitions.
Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Use a narrow transformation to reduce the number of partitions.

Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the

DataFrame.

Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.

Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" C this is something the coalesce operation will not do. As a general rule, it will reduce the number of partitions with the very least movement of data possible. More info: distributed computing – Spark – repartition() vs coalesce() – Stack Overflow Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info:

pyspark.sql.DataFrame.coalesce ―

PySpark 3.1.2 documentation

Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.

No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient

way of reducing the number of partitions of all listed options.

Use a wide transformation to reduce the number of partitions.

No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.

Question #417

col(["transactionId", "predError", "value", "f"])

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.select(["transactionId", "predError", "value", "f"])

The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(["transactionId", "predError",

"value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a

list. Likewise, all columns being specified in a single string like "transactionId, predError,

value, f" is not valid

syntax.

filter and where filter rows based on conditions, they do not control which columns to return.

Static notebook | Dynamic notebook: See test 2,

Question #417

col(["transactionId", "predError", "value", "f"])

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.select(["transactionId", "predError", "value", "f"])

The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(["transactionId", "predError",

"value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a

list. Likewise, all columns being specified in a single string like "transactionId, predError,

value, f" is not valid

syntax.

filter and where filter rows based on conditions, they do not control which columns to return.

Static notebook | Dynamic notebook: See test 2,

Question #417

col(["transactionId", "predError", "value", "f"])

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.select(["transactionId", "predError", "value", "f"])

The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(["transactionId", "predError",

"value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a

list. Likewise, all columns being specified in a single string like "transactionId, predError,

value, f" is not valid

syntax.

filter and where filter rows based on conditions, they do not control which columns to return.

Static notebook | Dynamic notebook: See test 2,

Question #417

col(["transactionId", "predError", "value", "f"])

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.select(["transactionId", "predError", "value", "f"])

The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(["transactionId", "predError",

"value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a

list. Likewise, all columns being specified in a single string like "transactionId, predError,

value, f" is not valid

syntax.

filter and where filter rows based on conditions, they do not control which columns to return.

Static notebook | Dynamic notebook: See test 2,

Question #417

col(["transactionId", "predError", "value", "f"])

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.select(["transactionId", "predError", "value", "f"])

The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(["transactionId", "predError",

"value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a

list. Likewise, all columns being specified in a single string like "transactionId, predError,

value, f" is not valid

syntax.

filter and where filter rows based on conditions, they do not control which columns to return.

Static notebook | Dynamic notebook: See test 2,

Question #417

col(["transactionId", "predError", "value", "f"])

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.select(["transactionId", "predError", "value", "f"])

The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(["transactionId", "predError",

"value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a

list. Likewise, all columns being specified in a single string like "transactionId, predError,

value, f" is not valid

syntax.

filter and where filter rows based on conditions, they do not control which columns to return.

Static notebook | Dynamic notebook: See test 2,

Question #423

dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss"))

E. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

This QUESTION NO: is tricky. Two things are important to know here:

First, the syntax for createDataFrame: Here you need a list of tuples, like so: [(1,), (2,)]. To define a tuple in Python, if you just have a single item in it, it is important to put a comma after the item so

that Python interprets it as a tuple and not just a normal parenthesis.

Second, you should understand the to_timestamp syntax. You can find out more about it in the documentation linked below.

For good measure, let’s examine in detail why the incorrect options are wrong: dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

This code snippet does everything the QUESTION NO: asks for C except that the data type of the date column is a string and not a timestamp. When no schema is specified, Spark sets the string

data type as default.

dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])

dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date"))

In the first row of this command, Spark throws the following error: TypeError: Can not infer schema for type: <class ‘str’>. This is because Spark expects to find row information, but instead finds

strings. This is why you need to specify the data as tuples. Fortunately, the Spark

documentation (linked below) shows a number of examples for creating DataFrames that

should help you get on

the right track here.

dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-dd HH:mm:ss"))

The issue with this answer is that the operator withColumnRenamed is used. This operator simply renames a column, but it has no power to modify its actual content. This is why withColumn should

be used instead. In addition, the date format yyyy-MM-dd HH:mm:ss does not reflect the format of the actual timestamp: "23/01/2022 11:28:12".

dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss"))

Here, withColumnRenamed is used instead of withColumn (see above). In addition, the rows are not expressed correctly C they should be written as tuples, using parentheses. Finally, even the date

format is off here (see above).

More info: pyspark.sql.functions.to_timestamp ― PySpark 3.1.2 documentation and pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 2, 38.(Databricks import instructions)

Question #424

Which of the following is a characteristic of the cluster manager?

A . Each cluster manager works on a single partition of data.
B . The cluster manager receives input from the driver through the SparkContext.
C . The cluster manager does not exist in standalone mode.
D . The cluster manager transforms jobs into DAGs.
E . In client mode, the cluster manager runs on the edge node.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The cluster manager receives input from the driver through the SparkContext. Correct. In order for the driver to contact the cluster manager, the driver launches a SparkContext. The driver then asks the cluster manager for resources to launch executors. In client mode, the cluster manager runs on the edge node.

No. In client mode, the cluster manager is independent of the edge node and runs in the cluster.

The cluster manager does not exist in standalone mode.

Wrong, the cluster manager exists even in standalone mode. Remember, standalone mode is an easy means to deploy Spark across a whole cluster, with some limitations. For example, in standalone mode, no other frameworks can run in parallel with Spark. The cluster manager is part of Spark in standalone deployments however and helps launch and maintain resources across the cluster.

The cluster manager transforms jobs into DAGs.

No, transforming jobs into DAGs is the task of the Spark driver.

Each cluster manager works on a single partition of data.

No. Cluster managers do not work on partitions directly. Their job is to coordinate cluster resources so that they can be requested by and allocated to Spark drivers. More info: Introduction to Core Spark Concepts • BigData

Question #425

Which of the following statements about DAGs is correct?

A . DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.
B . DAG stands for "Directing Acyclic Graph".
C . Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.
D . In contrast to transformations, DAGs are never lazily executed.
E . DAGs can be decomposed into tasks that are executed in parallel.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

DAG stands for "Directing Acyclic Graph".

No, DAG stands for "Directed Acyclic Graph".

Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.

No, quite the opposite. You can access DAGs through the Spark UI and they can be of great help when optimizing queries manually.

In contrast to transformations, DAGs are never lazily executed.

DAGs represent the execution plan in Spark and as such are lazily executed when the driver requests the data processed in the DAG.

Subscribe

0 Comments

Inline Feedbacks

View all comments