Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Databricks Certified Associate Developer for Apache Spark 3.0 exam Online Training

Question #1

Which of the following code blocks silently writes DataFrame itemsDf in avro format to location fileLocation if a file does not yet exist at that location?

A . itemsDf.write.avro(fileLocation)
B . itemsDf.write.format("avro").mode("ignore").save(fileLocation)
C . itemsDf.write.format("avro").mode("errorifexists").save(fileLocation)
D . itemsDf.save.format("avro").mode("ignore").write(fileLocation)
E . spark.DataFrameWriter(itemsDf).format("avro").write(fileLocation)

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The trick in this QUESTION NO: is knowing the "modes" of the DataFrameWriter. Mode ignore will ignore if a file already exists and not replace that file, but also not throw an error. Mode errorifexists will throw an error, and is the default mode of the DataFrameWriter. The QUESTION NO: explicitly calls for the DataFrame to be "silently" written if it does not exist, so you need to specify mode("ignore") here to avoid having Spark communicate any error to you if the file already exists.

The `overwrite’ mode would not be right here, since, although it would be silent, it would overwrite the already-existing file. This is not what the QUESTION NO: asks for.

It is worth noting that the option starting with spark.DataFrameWriter(itemsDf) cannot work, since spark references the SparkSession object, but that object does not provide the DataFrameWriter.

As you can see in the documentation (below), DataFrameWriter is part of PySpark’s SQL

API, but not of its SparkSession API.

More info:

DataFrameWriter: pyspark.sql.DataFrameWriter.save ― PySpark 3.1.1 documentation

SparkSession API: Spark SQL ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,59.(Databricks import instructions)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #2

Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

A . 1, 10
B . 1, 8
C . 10
D . 7, 9, 10
E . 1, 4, 6, 9

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

1: Correct C This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.

4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.

6: No, StringType is a correct type.

7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.

8: Correct C TreeType is not a type that Spark supports.

9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.

10: There is nothing wrong with this row.

More info: Data Types – Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

Question #30

5

Reveal Solution Hide Solution

Correct Answer: A

Explanation:

This QUESTION NO: deals with the parameters of Spark’s split operator for strings. To solve this question, you first need to understand the difference between DataFrame.withColumn() and DataFrame.withColumnRenamed(). The correct option here is DataFrame.withColumn() since, according to the question, we want to add a column and not rename an existing column. This leaves you with only 3 answers to consider.

The second gap should be filled with the name of the new column to be added to the DataFrame. One of the remaining answers states the column name as itemNameBetweenSeparators, while the other two state it as "itemNameBetweenSeparators". The correct option here is "itemNameBetweenSeparators", since the other option would let Python try to interpret itemNameBetweenSeparators as the name of a variable, which we have not defined. This leaves you with 2 answers to consider.

The decision boils down to how to fill gap 5. Either with 4 or with 5. The QUESTION NO: asks for arrays of maximum four strings. The code in gap 5 relates to the limit parameter of Spark’s split operator

(see documentation linked below). The documentation states that "the resulting array’s

length will not be more than limit", meaning that we should pick the answer option with 4 as

the code in the

fifth gap here.

On a side note: One answer option includes a function str_split. This function does not exist in pySpark.

More info: pyspark.sql.functions.split ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #30

5

Reveal Solution Hide Solution

Correct Answer: A

Explanation:

This QUESTION NO: deals with the parameters of Spark’s split operator for strings. To solve this question, you first need to understand the difference between DataFrame.withColumn() and DataFrame.withColumnRenamed(). The correct option here is DataFrame.withColumn() since, according to the question, we want to add a column and not rename an existing column. This leaves you with only 3 answers to consider.

The second gap should be filled with the name of the new column to be added to the DataFrame. One of the remaining answers states the column name as itemNameBetweenSeparators, while the other two state it as "itemNameBetweenSeparators". The correct option here is "itemNameBetweenSeparators", since the other option would let Python try to interpret itemNameBetweenSeparators as the name of a variable, which we have not defined. This leaves you with 2 answers to consider.

The decision boils down to how to fill gap 5. Either with 4 or with 5. The QUESTION NO: asks for arrays of maximum four strings. The code in gap 5 relates to the limit parameter of Spark’s split operator

(see documentation linked below). The documentation states that "the resulting array’s

length will not be more than limit", meaning that we should pick the answer option with 4 as

the code in the

fifth gap here.

On a side note: One answer option includes a function str_split. This function does not exist in pySpark.

More info: pyspark.sql.functions.split ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #32

Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?

A . transactionsDf.sort(asc(value)).show(10)
B . transactionsDf.sort(col("value")).show(10)
C . transactionsDf.sort(col("value").desc()).head()
D . transactionsDf.sort(col("value").asc()).print(10)
E . transactionsDf.orderBy("value").asc().show(10)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

show() is the correct method to look for here, since the QUESTION NO: specifically asks for displaying the rows in a nicely formatted way. Here is the output of show (only a few rows shown):

+————-+———+—–+——-+———+—-+—————+

|transactionId|predError|value|storeId|productId| f|transactionDate|

+————- +——— +—– +——- +——— +—- +————— +

| 3| 3| 1| 25| 3|null| 1585824821|

| 5| null| 2| null| 2|null| 1575285427|

| 4| null| 3| 3| 2|null| 1583244275|

+————- +——— +—– +——- +——— +—- +————— +

With regards to the sorting, specifically in ascending order since the smallest values should be shown first, the following expressions are valid:

– transactionsDf.sort(col("value")) ("ascending" is the default sort direction in the sort method)

– transactionsDf.sort(asc(col("value")))

– transactionsDf.sort(asc("value"))

– transactionsDf.sort(transactionsDf.value.asc())

– transactionsDf.sort(transactionsDf.value)

Also, orderBy is just an alias of sort, so all of these expressions work equally well using orderBy.

Static notebook | Dynamic notebook: See test 1,

Question #32

Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?

A . transactionsDf.sort(asc(value)).show(10)
B . transactionsDf.sort(col("value")).show(10)
C . transactionsDf.sort(col("value").desc()).head()
D . transactionsDf.sort(col("value").asc()).print(10)
E . transactionsDf.orderBy("value").asc().show(10)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

show() is the correct method to look for here, since the QUESTION NO: specifically asks for displaying the rows in a nicely formatted way. Here is the output of show (only a few rows shown):

+————-+———+—–+——-+———+—-+—————+

|transactionId|predError|value|storeId|productId| f|transactionDate|

+————- +——— +—– +——- +——— +—- +————— +

| 3| 3| 1| 25| 3|null| 1585824821|

| 5| null| 2| null| 2|null| 1575285427|

| 4| null| 3| 3| 2|null| 1583244275|

+————- +——— +—– +——- +——— +—- +————— +

With regards to the sorting, specifically in ascending order since the smallest values should be shown first, the following expressions are valid:

– transactionsDf.sort(col("value")) ("ascending" is the default sort direction in the sort method)

– transactionsDf.sort(asc(col("value")))

– transactionsDf.sort(asc("value"))

– transactionsDf.sort(transactionsDf.value.asc())

– transactionsDf.sort(transactionsDf.value)

Also, orderBy is just an alias of sort, so all of these expressions work equally well using orderBy.

Static notebook | Dynamic notebook: See test 1,

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #34

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A . from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B . transactionsDf.cache()
C . transactionsDf.storage_level(‘MEMORY_ONLY’)
D . transactionsDf.persist()
E . transactionsDf.clear_persist()
F . from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Reveal Solution Hide Solution

Correct Answer: F
F

Explanation:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache()

This is wrong because the default storage level of DataFrame.cache() is

MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk.

transactionsDf.persist()

This is wrong because the default storage level of DataFrame.persist() is

MEMORY_AND_DISK.

transactionsDf.clear_persist()

Incorrect, since clear_persist() is not a method of DataFrame.

transactionsDf.storage_level(‘MEMORY_ONLY’)

Wrong. storage_level is not a method of DataFrame.

More info: RDD Programming Guide – Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist ― PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #55

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #77

"left_semi"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")

This QUESTION NO: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf – the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([…]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([…]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #98

parquet

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

Correct code block:

transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir)

Solving this QUESTION NO: requires you to know how to access the DataFrameWriter (link below) from the DataFrame API – through DataFrame.write.

Another nuance here is about knowing the different modes available for writing parquet files that determine Spark’s behavior when dealing with existing files. These, together with the compression

options are explained in the DataFrameWriter.parquet documentation linked below.

Finally, bracket __5__ poses a certain challenge. You need to know which command you

can use to pass down the file path to the DataFrameWriter. Both save and parquet are

valid options here.

More info:

– DataFrame.write: pyspark.sql.DataFrame.write ― PySpark 3.1.1 documentation

– DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #120

parquet

Reveal Solution Hide Solution

Correct Answer: C

Question #121

Which of the following is a viable way to improve Spark’s performance when dealing with large amounts of data, given that there is only a single application running on the cluster?

A . Increase values for the properties spark.default.parallelism and spark.sql.shuffle.partitions
B . Decrease values for the properties spark.default.parallelism and spark.sql.partitions
C . Increase values for the properties spark.sql.parallelism and spark.sql.partitions
D . Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions
E . Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Decrease values for the properties spark.default.parallelism and spark.sql.partitions No, these values need to be increased.

Increase values for the properties spark.sql.parallelism and spark.sql.partitions Wrong, there is no property spark.sql.parallelism.

Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions See above.

Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions

The property spark.dynamicAllocation.maxExecutors is only in effect if dynamic allocation is enabled, using the spark.dynamicAllocation.enabled property. It is disabled by default. Dynamic

allocation can be useful when to run multiple applications on the same cluster in parallel. However, in this case there is only a single application running on the cluster, so enabling dynamic

allocation would not yield a performance benefit.

More info: Practical Spark Tips For Data Scientists | Experfy.com and Basics of Apache

Spark Configuration Settings | by Halil Ertan | Towards Data Science

(https://bit.ly/3gA0A6w ,

https://bit.ly/2QxhNTr)

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #122

Which of the following is the deepest level in Spark’s execution hierarchy?

A . Job
B . Task
C . Executor
D . Slot
E . Stage

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question #135

count()

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

Correct code block:

from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY_2).count() Only persist takes different storage levels, so any option using cache() cannot be correct. persist() is evaluated lazily, so an action needs to follow this command. select() is not an action, but count() is C so all options using select() are incorrect.

Finally, the QUESTION NO: states that "the executors’ memory should be utilized as much as possible, but not writing anything to disk". This points to a MEMORY_ONLY storage level. In this storage level, partitions that do not fit into memory will be recomputed when they are needed, instead of being written to disk, as with the storage option MEMORY_AND_DISK. Since the data need to be duplicated across two executors, _2 needs to be appended to the storage level. Static notebook | Dynamic notebook: See test 2, 25.(Databricks import instructions)

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #136

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A . tranactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)
B . transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})
C . transactionsDf.select(‘value’, ‘productId’).distinct()
D . transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct()
E . transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.select(‘value’).union(transactionsDf.select(‘productId’)).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below).

transactionsDf.select(‘value’, ‘productId’).distinct()

Wrong. This code block returns unique rows, but not unique values.

transactionsDf.agg({‘value’: ‘collect_set’, ‘productId’: ‘collect_set’})

Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).

transactionsDf.select(col(‘value’), col(‘productId’)).agg({‘*’: ‘count’})

No. This command will count the number of rows, but will not return unique values.

transactionsDf.select(‘value’).join(transactionsDf.select(‘productId’), col(‘value’)==col(‘productId’), ‘outer’)

Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read

up on the difference between union and join, a link is posted below.

More info: pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation, sql – What is the difference between JOIN and UNION? – Stack Overflow

Static notebook | Dynamic notebook: See test 3,

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #145

+————-+———+—–+——-+———+—-+

A . The column names should be listed directly as arguments to the operator and not as a list.
B . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C . The select operator should be replaced by a drop operator.
D . The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E . The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION NO: will make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId – for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as

Python variables

(see above).

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 30.(Databricks import instructions)

Question #166

col("value")

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.withColumn("cos", round(cos(degrees(transactionsDf.value)),2))

This QUESTION NO: is especially confusing because col, "cos" are so similar. Similar- looking answer options can also appear in the exam and, just like in this question, you need to pay attention to the details to identify what the correct answer option is.

The first answer option to throw out is the one that starts with withColumnRenamed: The QUESTION NO: speaks specifically of adding a column. The withColumnRenamed operator only renames an existing column, however, so you cannot use it here.

Next, you will have to decide what should be in gap 2, the first argument of transactionsDf.withColumn(). Looking at the documentation (linked below), you can find out that the first argument of

withColumn actually needs to be a string with the name of the column to be added. So, any answer that includes col("cos") as the option for gap 2 can be disregarded.

This leaves you with two possible answers. The real difference between these two answers is where the cos and degree methods are, either in gaps 3 and 4, or vice-versa. From the QUESTION NO: you

can find out that the new column should have "the values in column value converted to degrees and having the cosine of those converted values taken". This prescribes you a clear order of operations: First, you convert values from column value to degrees and then you take the cosine of those values. So, the inner parenthesis (gap 4) should contain the degree method and then,

logically, gap 3 holds the cos method. This leaves you with just one possible correct answer.

More info: pyspark.sql.DataFrame.withColumn ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, 49.(Databricks import instructions)

Question #167

Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?

A . transactionsDf.groupBy(col(storeId).avg())
B . transactionsDf.groupBy("storeId").avg(col("value"))
C . transactionsDf.groupBy("storeId").agg(avg("value"))
D . transactionsDf.groupBy("storeId").agg(average("value"))
E . transactionsDf.groupBy("value").average()

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

This QUESTION NO: tests your knowledge about how to use the groupBy and agg pattern in Spark. Using the documentation, you can find out that there is no average() method in pyspark.sql.functions.

Static notebook | Dynamic notebook: See test 2, 42.(Databricks import instructions)

Question #167

Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?

A . transactionsDf.groupBy(col(storeId).avg())
B . transactionsDf.groupBy("storeId").avg(col("value"))
C . transactionsDf.groupBy("storeId").agg(avg("value"))
D . transactionsDf.groupBy("storeId").agg(average("value"))
E . transactionsDf.groupBy("value").average()

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

This QUESTION NO: tests your knowledge about how to use the groupBy and agg pattern in Spark. Using the documentation, you can find out that there is no average() method in pyspark.sql.functions.

Static notebook | Dynamic notebook: See test 2, 42.(Databricks import instructions)

Question #169

spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

A . The commas in the tuples with the colors should be eliminated.
B . The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.
C . Instead of color, a data type should be specified.
D . The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Correct code block:

spark.createDataFrame([("red",), ("blue",), ("green",)], ["color"])

The createDataFrame syntax is not exactly straightforward, but luckily the documentation (linked below) provides several examples on how to use it. It also shows an example very similar to the

code block presented here which should help you answer this QUESTION NO: correctly.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #169

spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

A . The commas in the tuples with the colors should be eliminated.
B . The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.
C . Instead of color, a data type should be specified.
D . The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Correct code block:

spark.createDataFrame([("red",), ("blue",), ("green",)], ["color"])

The createDataFrame syntax is not exactly straightforward, but luckily the documentation (linked below) provides several examples on how to use it. It also shows an example very similar to the

code block presented here which should help you answer this QUESTION NO: correctly.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #171

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A . itemsDf.persist(StorageLevel.MEMORY_ONLY)
B . itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
C . itemsDf.store()
D . itemsDf.cache()
E . itemsDf.write.option(‘destination’, ‘memory’).save()

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The key to solving this QUESTION NO: is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory

to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.

If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.

Static notebook | Dynamic notebook: See test 2,

Question #183

+————-+———+—–+——-+———+—-+

A . transactionsDf.max(‘value’).min(‘value’)
B . transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
C . transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
D . transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
E . transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.

transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Incorrect. While this is valid Spark syntax, it does not achieve what the QUESTION NO: asks for. The QUESTION NO: specifically asks for values to be aggregated per value in column productId –

this column is

not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.

transactionsDf.max(‘value’).min(‘value’)

Wrong. There is no DataFrame.max() method in Spark, so this command will fail.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand

which columns you want to aggregate.

More info: pyspark.sql.DataFrame.agg ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #183

+————-+———+—–+——-+———+—-+

A . transactionsDf.max(‘value’).min(‘value’)
B . transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
C . transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
D . transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
E . transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.

transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Incorrect. While this is valid Spark syntax, it does not achieve what the QUESTION NO: asks for. The QUESTION NO: specifically asks for values to be aggregated per value in column productId –

this column is

not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.

transactionsDf.max(‘value’).min(‘value’)

Wrong. There is no DataFrame.max() method in Spark, so this command will fail.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand

which columns you want to aggregate.

More info: pyspark.sql.DataFrame.agg ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #183

+————-+———+—–+——-+———+—-+

A . transactionsDf.max(‘value’).min(‘value’)
B . transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
C . transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
D . transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
E . transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.

transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Incorrect. While this is valid Spark syntax, it does not achieve what the QUESTION NO: asks for. The QUESTION NO: specifically asks for values to be aggregated per value in column productId –

this column is

not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.

transactionsDf.max(‘value’).min(‘value’)

Wrong. There is no DataFrame.max() method in Spark, so this command will fail.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand

which columns you want to aggregate.

More info: pyspark.sql.DataFrame.agg ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #183

+————-+———+—–+——-+———+—-+

A . transactionsDf.max(‘value’).min(‘value’)
B . transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
C . transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
D . transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))
E . transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.groupby(‘productId’).agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.

transactionsDf.agg(max(‘value’).alias(‘highest’), min(‘value’).alias(‘lowest’))

Incorrect. While this is valid Spark syntax, it does not achieve what the QUESTION NO: asks for. The QUESTION NO: specifically asks for values to be aggregated per value in column productId –

this column is

not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.

transactionsDf.max(‘value’).min(‘value’)

Wrong. There is no DataFrame.max() method in Spark, so this command will fail.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand

which columns you want to aggregate.

More info: pyspark.sql.DataFrame.agg ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #187

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

D. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

E. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)),

T.StructType([T.StructField("season", T.CharType()), T.StructField("season",

T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python’s pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation and Data Types – Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1,

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #200

articlesDf = articlesDf.groupby("col").count()

B. 4, 5

C. 2, 5, 3

D. 5, 2

E. 2, 3, 4

F. 2, 5, 4

Reveal Solution Hide Solution

Question #217

"MM d (EEE)"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.withColumn("transactionDateForm", from_unixtime("transactionDate", "MMM d (EEEE)"))

The QUESTION NO: specifically asks about "adding" a column. In the context of all presented answers, DataFrame.withColumn() is the correct command for this. In theory, DataFrame.select() could also be

used for this purpose, if all existing columns are selected and a new one is added. DataFrame.withColumnRenamed() is not the appropriate command, since it can only rename existing columns, but cannot add a new column or change the value of a column.

Once DataFrame.withColumn() is chosen, you can read in the documentation (see below) that the first input argument to the method should be the column name of the new column. The final difficulty is the date format. The QUESTION NO: indicates that the date format

Apr 26 (Sunday) is desired. The answers give "MMM d (EEEE)" and "MM d (EEE)" as options. It can be hard to know the details of the date format that is used in Spark. Specifically, knowing the differences between MMM and MM is probably not something you deal with every day. But, there is an easy way to remember the difference: M (one letter) is usually the shortest form: 4 for April. MM includes padding: 04 for April. MMM (three letters) is the three-letter month abbreviation:

Apr for April. And

MMMM is the longest possible form: April. Knowing this four-letter sequence helps you select the correct option here.

More info: pyspark.sql.DataFrame.withColumn ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #217

"MM d (EEE)"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.withColumn("transactionDateForm", from_unixtime("transactionDate", "MMM d (EEEE)"))

The QUESTION NO: specifically asks about "adding" a column. In the context of all presented answers, DataFrame.withColumn() is the correct command for this. In theory, DataFrame.select() could also be

used for this purpose, if all existing columns are selected and a new one is added. DataFrame.withColumnRenamed() is not the appropriate command, since it can only rename existing columns, but cannot add a new column or change the value of a column.

Once DataFrame.withColumn() is chosen, you can read in the documentation (see below) that the first input argument to the method should be the column name of the new column. The final difficulty is the date format. The QUESTION NO: indicates that the date format

Apr 26 (Sunday) is desired. The answers give "MMM d (EEEE)" and "MM d (EEE)" as options. It can be hard to know the details of the date format that is used in Spark. Specifically, knowing the differences between MMM and MM is probably not something you deal with every day. But, there is an easy way to remember the difference: M (one letter) is usually the shortest form: 4 for April. MM includes padding: 04 for April. MMM (three letters) is the three-letter month abbreviation:

Apr for April. And

MMMM is the longest possible form: April. Knowing this four-letter sequence helps you select the correct option here.

More info: pyspark.sql.DataFrame.withColumn ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #217

"MM d (EEE)"

Reveal Solution Hide Solution

Correct Answer: C

Explanation:

Correct code block:

transactionsDf.withColumn("transactionDateForm", from_unixtime("transactionDate", "MMM d (EEEE)"))

The QUESTION NO: specifically asks about "adding" a column. In the context of all presented answers, DataFrame.withColumn() is the correct command for this. In theory, DataFrame.select() could also be

used for this purpose, if all existing columns are selected and a new one is added. DataFrame.withColumnRenamed() is not the appropriate command, since it can only rename existing columns, but cannot add a new column or change the value of a column.

Once DataFrame.withColumn() is chosen, you can read in the documentation (see below) that the first input argument to the method should be the column name of the new column. The final difficulty is the date format. The QUESTION NO: indicates that the date format

Apr 26 (Sunday) is desired. The answers give "MMM d (EEEE)" and "MM d (EEE)" as options. It can be hard to know the details of the date format that is used in Spark. Specifically, knowing the differences between MMM and MM is probably not something you deal with every day. But, there is an easy way to remember the difference: M (one letter) is usually the shortest form: 4 for April. MM includes padding: 04 for April. MMM (three letters) is the three-letter month abbreviation:

Apr for April. And

MMMM is the longest possible form: April. Knowing this four-letter sequence helps you select the correct option here.

More info: pyspark.sql.DataFrame.withColumn ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #220

itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Reveal Solution Hide Solution

Correct Answer: D

Explanation:

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier",

"feature1")

Correct! Spark’s DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong.

itemsDf.withColumnRenamed("attributes", "feature0")

itemsDf.withColumnRenamed("supplier", "feature1")

No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns.

More info: pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #227

importedDf = spark.read.json(jsonPath)

A . 4, 1, 2
B . 5, 1, 3
C . 5, 2
D . 4, 1, 3
E . 5, 1, 2

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

importedDf = spark.read.json(jsonPath)

importedDf.createOrReplaceTempView("importedDf")

spark.sql("SELECT * FROM importedDf WHERE productId != 3")

Option 5 is the only correct way listed of reading in a JSON in PySpark. The option("format", "json") is not the correct way to tell Spark’s DataFrameReader that you want to read a JSON file. You would do this through format("json") instead. Also, you can communicate the specific path of the JSON file to the DataFramReader using the load() method, not the path() method.

In order to use a SQL command through the SparkSession spark, you first need to create a temporary view through DataFrame.createOrReplaceTempView().

The SQL statement should start with the SELECT operator. The FILTER operator SQL provides is not the correct one to use here.

Static notebook | Dynamic notebook: See test 2,

Question #227

importedDf = spark.read.json(jsonPath)

A . 4, 1, 2
B . 5, 1, 3
C . 5, 2
D . 4, 1, 3
E . 5, 1, 2

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Correct code block:

importedDf = spark.read.json(jsonPath)

importedDf.createOrReplaceTempView("importedDf")

spark.sql("SELECT * FROM importedDf WHERE productId != 3")

Option 5 is the only correct way listed of reading in a JSON in PySpark. The option("format", "json") is not the correct way to tell Spark’s DataFrameReader that you want to read a JSON file. You would do this through format("json") instead. Also, you can communicate the specific path of the JSON file to the DataFramReader using the load() method, not the path() method.

In order to use a SQL command through the SparkSession spark, you first need to create a temporary view through DataFrame.createOrReplaceTempView().

The SQL statement should start with the SELECT operator. The FILTER operator SQL provides is not the correct one to use here.

Static notebook | Dynamic notebook: See test 2,

Question #229

Which of the following code blocks returns a copy of DataFrame transactionsDf where the column storeId has been converted to string type?

A . transactionsDf.withColumn("storeId", convert("storeId", "string"))
B . transactionsDf.withColumn("storeId", col("storeId", "string"))
C . transactionsDf.withColumn("storeId", col("storeId").convert("string"))
D . transactionsDf.withColumn("storeId", col("storeId").cast("string"))
E . transactionsDf.withColumn("storeId", convert("storeId").as("string"))

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

This QUESTION NO: asks for your knowledge about the cast syntax. cast is a method of the Column class. It is worth noting that one could also convert a column type using the Column.astype()

method, which is just an alias for cast.

Find more info in the documentation linked below.

More info: pyspark.sql.Column.cast ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 33.(Databricks import instructions)

Question #230

Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?

A . itemsDf.write.mode("overwrite").parquet(filePath)
B . itemsDf.write.option("parquet").mode("overwrite").path(filePath)
C . itemsDf.write(filePath, mode="overwrite")
D . itemsDf.write.mode("overwrite").path(filePath)
E . itemsDf.write().parquet(filePath, mode="overwrite")

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation: itemsDf.write.mode("overwrite").parquet(filePath)

Correct! itemsDf.write returns a pyspark.sql.DataFrameWriter instance whose overwriting behavior can be modified via the mode setting or by passing mode="overwrite" to the parquet() command.

Although the parquet format is not prescribed for solving this question, parquet() is a valid operator to initiate Spark to write the data to disk.

itemsDf.write.mode("overwrite").path(filePath)

No. A pyspark.sql.DataFrameWriter instance does not have a path() method.

itemsDf.write.option("parquet").mode("overwrite").path(filePath)

Incorrect, see above. In addition, a file format cannot be passed via the option() method.

itemsDf.write(filePath, mode="overwrite")

Wrong. Unfortunately, this is too simple. You need to obtain access to a DataFrameWriter for the DataFrame through calling itemsDf.write upon which you can apply further methods to control how

Spark data should be written to disk. You cannot, however, pass arguments to itemsDf.write directly.

itemsDf.write().parquet(filePath, mode="overwrite")

False. See above.

More info: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #230

Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?

A . itemsDf.write.mode("overwrite").parquet(filePath)
B . itemsDf.write.option("parquet").mode("overwrite").path(filePath)
C . itemsDf.write(filePath, mode="overwrite")
D . itemsDf.write.mode("overwrite").path(filePath)
E . itemsDf.write().parquet(filePath, mode="overwrite")

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation: itemsDf.write.mode("overwrite").parquet(filePath)

Correct! itemsDf.write returns a pyspark.sql.DataFrameWriter instance whose overwriting behavior can be modified via the mode setting or by passing mode="overwrite" to the parquet() command.

Although the parquet format is not prescribed for solving this question, parquet() is a valid operator to initiate Spark to write the data to disk.

itemsDf.write.mode("overwrite").path(filePath)

No. A pyspark.sql.DataFrameWriter instance does not have a path() method.

itemsDf.write.option("parquet").mode("overwrite").path(filePath)

Incorrect, see above. In addition, a file format cannot be passed via the option() method.

itemsDf.write(filePath, mode="overwrite")

Wrong. Unfortunately, this is too simple. You need to obtain access to a DataFrameWriter for the DataFrame through calling itemsDf.write upon which you can apply further methods to control how

Spark data should be written to disk. You cannot, however, pass arguments to itemsDf.write directly.

itemsDf.write().parquet(filePath, mode="overwrite")

False. See above.

More info: pyspark.sql.DataFrameWriter.parquet ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #232

Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?

A . Slot is another name for executor.
B . There must be less executors than tasks.
C . An executor runs on a single core.
D . There must be more slots than tasks.
E . Tasks run in parallel via slots.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Tasks run in parallel via slots.

Correct. Given the assumption, an executor then has one or more "slots", defined by the equation spark.executor.cores / spark.task.cpus. With the executor’s resources divided into slots, each task

takes up a slot and multiple tasks can be executed in parallel.

Slot is another name for executor.

No, a slot is part of an executor.

An executor runs on a single core.

No, an executor can occupy multiple cores. This is set by the spark.executor.cores option.

There must be more slots than tasks.

No. Slots just process tasks. One could imagine a scenario where there was just a single slot for multiple tasks, processing one task at a time. Granted C this is the opposite of what Spark should be

used for, which is distributed data processing over multiple cores and machines, performing many tasks in parallel.

There must be less executors than tasks.

No, there is no such requirement.

More info: Spark Architecture | Distributed Systems Architecture (https://bit.ly/3x4MZZt)

Question #233

Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format

month/day/year in column transactionDateFormatted?

Excerpt of DataFrame transactionsDf:

A . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))
B . transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))
C . transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFor matted")
D . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))
E . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate",

format="MM/dd/yyyy"))

Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark’s from_unixtime method to transform values in column

transactionDate into strings, following the format requested in the question.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column.

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFor matted")

Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped

data C but this is irrelevant for this question, since we do not deal with grouped data here.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this: 2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not

what is asked for in the question.

More info: pyspark.sql.functions.from_unixtime ― PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #233

Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format

month/day/year in column transactionDateFormatted?

Excerpt of DataFrame transactionsDf:

A . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))
B . transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))
C . transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFor matted")
D . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))
E . transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate",

format="MM/dd/yyyy"))

Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark’s from_unixtime method to transform values in column

transactionDate into strings, following the format requested in the question.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column.

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFor matted")

Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped

data C but this is irrelevant for this question, since we do not deal with grouped data here.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this: 2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not

what is asked for in the question.

More info: pyspark.sql.functions.from_unixtime ― PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #235

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column

storeId as key for partitioning. Find the error.

Code block:

transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_s plit")A.

A . The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
B . Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
C . Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.
D . partitionOn("storeId") should be called before the write operation.
E . The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Correct code block:

transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_s plit")

More info: partition by – Reading files which are written using PartitionBy or BucketBy in Spark – Stack Overflow

Static notebook | Dynamic notebook: See test 1,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #250

spark.sql(statement).drop("value", "storeId", "attributes")

Reveal Solution Hide Solution

Correct Answer: E

Explanation:

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question.

However, this variety reflects the variety of ways that one can express a join in PySpark.

You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes") Correct – this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf

drop(col(‘value’), col(‘storeId’)) join(itemsDf.drop(col(‘attributes’)), col(‘productId’)==col(‘itemId’))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop(‘value’, ‘storeId’) instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect – Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop(‘value’, ‘storeId’).join(itemsDf.select(‘attributes’), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView(‘transactionsDf’)

itemsDf.createOrReplaceTempView(‘itemsDf’)

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Question #260

transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))

Reveal Solution Hide Solution

Correct Answer: A

Explanation:

Recognizing that the UDF specification requires a return type (unless it is a string, which is the default) is important for solving this question. In addition, you should make sure that the generated

UDF (evaluateTestSuccessUDF) and not the Python function (evaluateTestSuccess) is applied to column storeId.

More info: pyspark.sql.functions.udf ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 34.(Databricks import instructions)

Question #261

Which of the following statements about broadcast variables is correct?

A . Broadcast variables are serialized with every single task.
B . Broadcast variables are commonly used for tables that do not fit into memory.
C . Broadcast variables are immutable.
D . Broadcast variables are occasionally dynamically updated on a per-task basis.
E . Broadcast variables are local to the worker node and not shared across the cluster.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Broadcast variables are local to the worker node and not shared across the cluster.

This is wrong because broadcast variables are meant to be shared across the cluster. As such, they are never just local to the worker node, but available to all worker nodes. Broadcast variables are commonly used for tables that do not fit into memory.

This is wrong because broadcast variables can only be broadcast because they are small and do fit into memory.

Broadcast variables are serialized with every single task.

This is wrong because they are cached on every machine in the cluster, precisely avoiding to have to be serialized with every single task.

Broadcast variables are occasionally dynamically updated on a per-task basis.

This is wrong because broadcast variables are immutable C they are never updated. More info: Spark C The Definitive Guide, Chapter 14

Question #262

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.

Code block:

transactionsDf.filter(col(‘predError’).in([3, 6])).count()

A . The number of rows cannot be determined with the count() operator.
B . Instead of filter, the select method should be used.
C . The method used on column predError is incorrect.
D . Instead of a list, the values need to be passed as single arguments to the in operator.
E . Numbers 3 and 6 need to be passed as string variables.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Correct code block:

transactionsDf.filter(col(‘predError’).isin([3, 6])).count()

The isin method is the correct one to use here C the in method does not exist for the Column object.

More info: pyspark.sql.Column.isin ― PySpark 3.1.2 documentation

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #263

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A . transactionsDf.drop(["predError", "value"])
B . transactionsDf.drop("predError", "value")
C . transactionsDf.drop(col("predError"), col("value"))
D . transactionsDf.drop(predError, value)
E . transactionsDf.drop("predError & value")

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

More info: pyspark.sql.DataFrame.drop ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, 58.(Databricks import instructions)

Question #275

spark.read.options("modifiedBefore", "2029-03-

20T05:44:46").schema(schema).load(filePath)

A . The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark’s DataFrameReader is incorrect.
B . Columns in the schema definition use the wrong object type and the syntax of the call to Spark’s DataFrameReader is incorrect.
C . The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
D . Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
E . Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Correct code block:

schema = StructType([

StructField("itemId", IntegerType(), True),

StructField("attributes", ArrayType(StringType(), True), True),

StructField("supplier", StringType(), True)

])

spark.read.options(modifiedBefore="2029-03-

20T05:44:46").schema(schema).parquet(filePath)

This QUESTION NO: is more difficult than what you would encounter in the exam. In the

exam, for this QUESTION NO: type, only one error needs to be identified and not "one or

multiple" as in the

question.

Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways

of expressing a schema in Spark. A StructType always contains a list of StructFields (see

documentation linked below). So, nesting StructType and StructType as shown in the QUESTION NO: is wrong.

The modification date threshold should be specified by a keyword argument like options(modifiedBefore="2029-03-20T05:44:46") and not two consecutive non-keyword arguments as in the original

code block (see documentation linked below).

Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for

example, DataFrameReader.parquet().

Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be

nullable and this is specified correctly by the third argument being True in the schema in the code block.

It is correct, however, that the modification date threshold is specified incorrectly (see above).

The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark’s DataFrameReader is incorrect.

Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer

above. In addition, the DataFrameReader is called correctly through the SparkSession spark.

Columns in the schema definition use the wrong object type and the syntax of the call to Spark’s DataFrameReader is incorrect.

Incorrect, the object types in the schema definition are correct and syntax of the call to Spark’s DataFrameReader is correct.

The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified

incorrectly (see correct answer above).

Question #276

Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?

A . itemsDf.cache().count()
B . itemsDf.cache(eager=True)
C . cache(itemsDf)
D . itemsDf.cache().filter()
E . itemsDf.rdd.storeCopy()

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Caching means storing a copy of a partition on an executor, so it can be accessed quicker by subsequent operations, instead of having to be recalculated. cache() is a lazily-evaluated method of the

DataFrame. Since count() is an action (while filter() is not), it triggers the caching process. More info: pyspark.sql.DataFrame.cache ― PySpark 3.1.2 documentation, Learning Spark, 2nd Edition, Chapter 7

Static notebook | Dynamic notebook: See test 2, 20.(Databricks import instructions)

Question #277

The code block displayed below contains an error. The code block is intended to join

DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.

Code block:

transactionsDf.join(itemsDf, "itemId", how="broadcast")

A . The syntax is wrong, how= should be removed from the code block.
B . The join method should be replaced by the broadcast method.
C . Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
D . The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
E . broadcast is not a valid join type.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

broadcast is not a valid join type.

Correct! The code block should read transactionsDf.join(broadcast(itemsDf), "itemId"). This would imply an inner join (this is the default in DataFrame.join()), but since the join type is not given in the

question, this would be a valid choice.

The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.

This option does not apply here, since the syntax around broadcasting is incorrect. Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.

No, it is enabled by default, since the spark.sql.autoBroadcastJoinThreshold property is set to 10 MB by default. If that property would be set to -1, then broadcast joining would be disabled.

More info: Performance Tuning – Spark 3.1.1 Documentation (https://bit.ly/3gCz34r) The join method should be replaced by the broadcast method.

No, DataFrame has no broadcast() method.

The syntax is wrong, how= should be removed from the code block.

No, having the keyword argument how= is totally acceptable.

Question #277

The code block displayed below contains an error. The code block is intended to join

DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.

Code block:

transactionsDf.join(itemsDf, "itemId", how="broadcast")

A . The syntax is wrong, how= should be removed from the code block.
B . The join method should be replaced by the broadcast method.
C . Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
D . The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
E . broadcast is not a valid join type.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

broadcast is not a valid join type.

Correct! The code block should read transactionsDf.join(broadcast(itemsDf), "itemId"). This would imply an inner join (this is the default in DataFrame.join()), but since the join type is not given in the

question, this would be a valid choice.

The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.

This option does not apply here, since the syntax around broadcasting is incorrect. Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.

No, it is enabled by default, since the spark.sql.autoBroadcastJoinThreshold property is set to 10 MB by default. If that property would be set to -1, then broadcast joining would be disabled.

More info: Performance Tuning – Spark 3.1.1 Documentation (https://bit.ly/3gCz34r) The join method should be replaced by the broadcast method.

No, DataFrame has no broadcast() method.

The syntax is wrong, how= should be removed from the code block.

No, having the keyword argument how= is totally acceptable.

Question #279

print(itemsDf.types)

B. itemsDf.printSchema()

C. spark.schema(itemsDf)

D. itemsDf.rdd.printSchema()

E. itemsDf.print.schema()

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

itemsDf.printSchema()

Correct! Here is an example of what itemsDf.printSchema() shows, you can see the tree-like structure containing both column names and types: root

|– itemId: integer (nullable = true)

|– attributes: array (nullable = true)

| |– element: string (containsNull = true)

|– supplier: string (nullable = true)

itemsDf.rdd.printSchema()

No, the DataFrame’s underlying RDD does not have a printSchema() method.

spark.schema(itemsDf)

Incorrect, there is no spark.schema command.

print(itemsDf.columns)

print(itemsDf.dtypes)

Wrong. While the output of this code blocks contains both column names and column types, the information is not arranges in a tree-like way.

itemsDf.print.schema()

No, DataFrame does not have a print method.

Static notebook | Dynamic notebook: See test 3,

Question #279

print(itemsDf.types)

B. itemsDf.printSchema()

C. spark.schema(itemsDf)

D. itemsDf.rdd.printSchema()

E. itemsDf.print.schema()

Reveal Solution Hide Solution

Correct Answer: B

Explanation:

itemsDf.printSchema()

Correct! Here is an example of what itemsDf.printSchema() shows, you can see the tree-like structure containing both column names and types: root

|– itemId: integer (nullable = true)

|– attributes: array (nullable = true)

| |– element: string (containsNull = true)

|– supplier: string (nullable = true)

itemsDf.rdd.printSchema()

No, the DataFrame’s underlying RDD does not have a printSchema() method.

spark.schema(itemsDf)

Incorrect, there is no spark.schema command.

print(itemsDf.columns)

print(itemsDf.dtypes)

Wrong. While the output of this code blocks contains both column names and column types, the information is not arranges in a tree-like way.

itemsDf.print.schema()

No, DataFrame does not have a print method.

Static notebook | Dynamic notebook: See test 3,

Question #281

The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.

Code block:

transactionsDf.agg("storeId").avg("value")

A . Instead of avg("value"), avg(col("value")) should be used.
B . The avg("value") should be specified as a second argument to agg() instead of being appended to it.
C . All column names should be wrapped in col() operators.
D . agg should be replaced by groupBy.
E . "storeId" and "value" should be swapped.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Static notebook | Dynamic notebook: See test 1, 30.(Databricks import instructions) (https://flrs.github.io/spark_practice_tests_code/#1/30.html , https://bit.ly/sparkpracticeexams_import_instructions)

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #282

Which of the following statements about the differences between actions and transformations is correct?

A . Actions are evaluated lazily, while transformations are not evaluated lazily.
B . Actions generate RDDs, while transformations do not.
C . Actions do not send results to the driver, while transformations do.
D . Actions can be queued for delayed execution, while transformations can only be processed immediately.
E . Actions can trigger Adaptive Query Execution, while transformation cannot.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Actions can trigger Adaptive Query Execution, while transformation cannot.

Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.

Actions are evaluated lazily, while transformations are not evaluated lazily.

Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.

Actions generate RDDs, while transformations do not.

No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,…) based on

the RDDs, but they do not generate them.

Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action. Actions do not send results to the driver, while transformations do.

No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to

the driver. They produce RDDs that remain on the worker nodes.

More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

Question #290

+——+———————————-+——————-+

Code block:

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

A . All column names need to be wrapped in the col() operator.
B . Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument "," needs to be passed to the split method.
C . Operator withColumnRenamed needs to be replaced with operator withColumn and the split method needs to be replaced by the splitString method.
D . Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument " " needs to be passed to the split method.
E . The expressions "itemNameElements" and split("itemName") need to be swapped.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Correct code block:

itemsDf.withColumn("itemNameElements", split("itemName"," "))

Output of code block:

+——+———————————-+——————-+——————————————+

|itemId|itemName |supplier |itemNameElements |

+——+———————————-+——————-+——————————————+

|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in, the, Snow]|

|2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress] |

|3 |Outdoors Backpack |Sports Company Inc.|[Outdoors, Backpack] |

+——+———————————-+——————-+——————————————+

The key to solving this QUESTION NO: is that the split method definitely needs a second argument here (also look at the link to the documentation below). Given the values in column itemName in

DataFrame itemsDf, this should be a space character " ". This is the character we need to split the words in the column.

More info: pyspark.sql.functions.split ― PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1, 46.(Databricks import instructions)

Question #291