spark.read.json(filePath, schema=schema)

spark.read.json(filePath, schema=schema)

C. spark.read.json(filePath, schema=schema_of_json(json_schema))

D. spark.read.json(filePath, schema=spark.read.json(json_schema))

Answer: B

Explanation:

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in – a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator’s documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string ‘{a: 1}’ to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType – exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

– pyspark.sql.DataFrameReader.schema ― PySpark 3.1.2 documentation

– pyspark.sql.DataFrameReader.json ― PySpark 3.1.2 documentation

– pyspark.sql.functions.schema_of_json ― PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3,

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>