Databricks Databricks Certified Professional Data Scientist Databricks Certified Professional Data Scientist Exam Online Training

Question #1

You are creating a regression model with the input income, education and current debt of a customer, what could be the possible output from this model.

  • A . Customer fit as a good
  • B . Customer fit as acceptable or average category
  • C . expressed as a percent, that the customer will default on a loan
  • D . 1 and 3 are correct
  • E . 2 and 3 are correct

Reveal Solution Hide Solution

Question #2

What type of output generated in case of linear regression?

  • A . Continuous variable
  • B . Discrete Variable
  • C . Any of the Continuous and Discrete variable
  • D . Values between 0 and 1

Reveal Solution Hide Solution

Question #3

If E1 and E2 are two events, how do you represent the conditional probability given that E2 occurs given that E1 has occurred?

  • A . P(E1)/P(E2)
  • B . P(E1+E2)/P(E1)
  • C . P(E2)/P(E1)
  • D . P(E2)/(P(E1+E2)

Reveal Solution Hide Solution

Question #4

In which of the scenario you can use the regression to predict the values

  • A . Samsung can use it for mobile sales forecast
  • B . Mobile companies can use it to forecast manufacturing defects
  • C . Probability of the celebrity divorce
  • D . Only 1 and 2
  • E . All 1 ,2 and 3

Reveal Solution Hide Solution

Question #5

You are working as a data science consultant for a gaming company. You have three member team and all other stake holders are from the company itself like project managers and project sponsored, data team etc.

During the discussion project managed asked you that when can you tell me that the model you are using is robust enough, after which step you can consider answer for this question?

  • A . Data Preparation
  • B . Discovery
  • C . Operationalize
  • D . Model planning
  • E . Model building

Reveal Solution Hide Solution

Question #6

Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several variables that may be……

  • A . Numerical
  • B . Categorical
  • C . Both 1 and 2 are correct
  • D . None of the 1 and 2 are correct

Reveal Solution Hide Solution

Question #7

Select the statement which applies correctly to the Naive Bayes

  • A . Works with a small amount of data
  • B . Sensitive to how the input data is prepared
  • C . Works with nominal values

Reveal Solution Hide Solution

Question #8

The figure below shows a plot of the data of a data matrix M that is 1000 x 2.

Which line represents the first principal component?

  • A . yellow
  • B . blue
  • C . Neither

Reveal Solution Hide Solution

Question #9

Refer to Exhibit

In the exhibit, the x-axis represents the derived probability of a borrower defaulting on a loan. Also in the exhibit, the pink represents borrowers that are known to have not defaulted on their loan, and the blue represents borrowers that are known to have defaulted on their loan.

Which analytical method could produce the probabilities needed to build this exhibit?

  • A . Linear Regression
  • B . Logistic Regression
  • C . Discriminant Analysis
  • D . Association Rules

Reveal Solution Hide Solution

Question #10

You are working on a email spam filtering assignment, while working on this you find there is new word e.g. HadoopExam comes in email, and in your solutions you never come across this word before, hence probability of this words is coming in either email could be zero.

So which of the following algorithm can help you to avoid zero probability?

  • A . Naive Bayes
  • B . Laplace Smoothing
  • C . Logistic Regression
  • D . All of the above

Reveal Solution Hide Solution

Question #11

A bio-scientist is working on the analysis of the cancer cells. To identify whether the cell is cancerous or not, there has been hundreds of tests are done with small variations to say yes to the problem. Given the test result for a sample of healthy and cancerous cells, which of the following technique you will use to determine whether a cell is healthy?

  • A . Linear regression
  • B . Collaborative filtering
  • C . Naive Bayes
  • D . Identification Test

Reveal Solution Hide Solution

Question #12

Consider flipping a coin for which the probability of heads is p, where p is unknown, and our goa is to estimate p. The obvious approach is to count how many times the coin came up heads and divide by the total number of coin flips. If we flip the coin 1000 times and it comes up heads 367 times, it is very reasonable to estimate p as approximately 0.367.

However, suppose we flip the coin only twice and we get heads both times. Is it reasonable to estimate p as 1.0? Intuitively, given that we only flipped the coin twice, it seems a bit rash to conclude that the coin will always come up heads, and____________is a way of avoiding such rash conclusions.

  • A . Naive Bayes
  • B . Laplace Smoothing
  • C . Logistic Regression
  • D . Linear Regression

Reveal Solution Hide Solution

Question #13

What is one modeling or descriptive statistical function in MADlib that is typically not provided in a standard relational database?

  • A . Expected value
  • B . Variance
  • C . Linear regression
  • D . Quantiles

Reveal Solution Hide Solution

Question #14

As a data scientist consultant at ABC Corp, you are working on a recommendation engine for the learning resources for end user.

So Which recommender system technique benefits most from additional user preference data?

  • A . Naive Bayes classifier
  • B . Item-based collaborative filtering
  • C . Logistic Regression
  • D . Content-based filtering

Reveal Solution Hide Solution

Question #15

Which of the following technique can be used to the design of recommender systems?

  • A . Naive Bayes classifier
  • B . Power iteration
  • C . Collaborative filtering
  • D . 1 and 3
  • E . 2 and 3

Reveal Solution Hide Solution

Question #16

Regularization is a very important technique in machine learning to prevent over fitting. And Optimizing with a L1 regularization term is harder than with an L2 regularization term because

  • A . The penalty term is not differentiate
  • B . The second derivative is not constant
  • C . The objective function is not convex
  • D . The constraints are quadratic

Reveal Solution Hide Solution

Question #17

In which of the following scenario we can use naTve Bayes theorem for classification

  • A . Classify whether a given person is a male or a female based on the measured features.
    The features include height, weight and foot size.
  • B . To classify whether an email is spam or not spam
  • C . To identify whether a fruit is an orange or not based on features like diameter, color and shape

Reveal Solution Hide Solution

Question #18

In unsupervised learning which statements correctly applies

  • A . It does not have a target variable
  • B . Instead of telling the machine Predict Y for our data X, we’re asking What can you tell me about X?
  • C . telling the machine Predict Y for our data X

Reveal Solution Hide Solution

Question #19

What is the best way to evaluate the quality of the model found by an unsupervised algorithm like k-means clustering, given metrics for the cost of the clustering (how well it fits the data) and its stability (how similar the clusters are across multiple runs over the same data)?

  • A . The lowest cost clustering subject to a stability constraint
  • B . The lowest cost clustering
  • C . The most stable clustering subject to a minimal cost constraint
  • D . The most stable clustering

Reveal Solution Hide Solution

Question #20

Which of the following is not a correct application for the Classification?

  • A . credit scoring
  • B . tumor detection
  • C . image recognition
  • D . drug discovery

Reveal Solution Hide Solution

Question #21

Your customer provided you with 2. 000 unlabeled records three groups.

What is the correct analytical method to use?

  • A . Semi Linear Regression
  • B . Logistic regression
  • C . Naive Bayesian classification
  • D . Linear regression
  • E . K-means clustering

Reveal Solution Hide Solution

Question #22

Which of the following steps you will be using in the discovery phase?

  • A . What all are the data sources for the project?
  • B . Analyze the Raw data and its format and structure.
  • C . What all tools are required, in the project?
  • D . What is the network capacity required
  • E . What Unix server capacity required?

Reveal Solution Hide Solution

Question #23

You are working on a problem where you have to predict whether the claim is done valid or not. And you find that most of the claims which are having spelling errors as well as corrections in the manually filled claim forms compare to the honest claims.

Which of the following technique is suitable to find out whether the claim is valid or not?

  • A . Naive Bayes
  • B . Logistic Regression
  • C . Random Decision Forests
  • D . Any one of the above

Reveal Solution Hide Solution

Question #24

The method based on principal component analysis (PCA) evaluates the features according to

  • A . The projection of the largest eigenvector of the correlation matrix on the initial dimensions
  • B . According to the magnitude of the components of the discriminate vector
  • C . The projection of the smallest eigenvector of the correlation matrix on the initial dimensions
  • D . None of the above

Reveal Solution Hide Solution

Question #25

Which is an example of supervised learning?

  • A . PCA
  • B . k-means clustering
  • C . SVD
  • D . EM
  • E . SVM

Reveal Solution Hide Solution

Question #26

You are designing a recommendation engine for a website where the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and helps the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user.

What kind of this recommendation engine is?

  • A . Naive Bayes classifier
  • B . Collaborative filtering
  • C . Logistic Regression
  • D . Content-based filtering

Reveal Solution Hide Solution

Question #27

RMSE measures error of a predicted

  • A . Numerical Value
  • B . Categorical values
  • C . For booth Numerical and categorical values

Reveal Solution Hide Solution

Question #28

Refer to exhibit

You are asked to write a report on how specific variables impact your client’s sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. After a preliminary analysis of the data, the following findings were made: 1. Multicollinearity is not an issue among the variables 2. Only three variables-A, B, and C-have significant correlation with sales You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. You cannot request additional data.

What is a way that you could try to increase the R2 of the model without artificially inflating it?

  • A . Create clusters based on the data and use them as model inputs
  • B . Force all 15 variables into the model as independent variables
  • C . Create interaction variables based only on variables A, B, and C
  • D . Break variables A, B, and C into their own univariate models

Reveal Solution Hide Solution

Question #29

Which technique you would be using to solve the below problem statement? "What is the probability that individual customer will not repay the loan amount?"

  • A . Classification
  • B . Clustering
  • C . Linear Regression
  • D . Logistic Regression
  • E . Hypothesis testing

Reveal Solution Hide Solution

Question #30

Question-3: In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features (such as the words in a language), i.e., turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values modulo the number of features as indices directly, rather than looking the indices up in an associative array.

So what is the primary reason of the hashing trick for building classifiers?

  • A . It creates the smaller models
  • B . It requires the lesser memory to store the coefficients for the model
  • C . It reduces the non-significant features e.g. punctuations
  • D . Noisy features are removed

Reveal Solution Hide Solution

Question #31

You have modeled the datasets with 5 independent variables called A, B, C, D and E having relationships which is not dependent each other, and also the variable A,B and C are continuous and variable D and E are discrete (mixed mode).

Now you have to compute the expected value of the variable let say A, then which of the following computation you will prefer

  • A . Integration
  • B . Differentiation
  • C . Transformation
  • D . Generalization

Reveal Solution Hide Solution

Question #32

Clustering is a type of unsupervised learning with the following goals

  • A . Maximize a utility function
  • B . Find similarities in the training data
  • C . Not to maximize a utility function
  • D . 1 and 2
  • E . 2 and 3

Reveal Solution Hide Solution

Question #33

Under which circumstance do you need to implement N-fold cross-validation after creating a regression model?

  • A . The data is unformatted.
  • B . There is not enough data to create a test set.
  • C . There are missing values in the data.
  • D . There are categorical variables in the model.

Reveal Solution Hide Solution

Question #34

You have data of 10.000 people who make the purchasing from a specific grocery store. You also have their income detail in the data. You have created 5 clusters using this data. But in one of the cluster you see that only 30 people are falling as below 30, 2400, 2600, 2700, 2270 etc."

What would you do in this case?

  • A . You will be increasing number of clusters.
  • B . You will be decreasing the number of clusters.
  • C . You will remove that 30 people from dataset
  • D . You will be multiplying standard deviation with the 100

Reveal Solution Hide Solution

Question #35

In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?

  • A . Discovery
  • B . Data Preparation
  • C . Model Building
  • D . Communicate Results

Reveal Solution Hide Solution

Question #36

What is the probability that the total of two dice will be greater than 8, given that the first die is a 6?

  • A . 1/3
  • B . 2/3
  • C . 1/6
  • D . 2/6

Reveal Solution Hide Solution

Question #37

Which of the following is a correct example of the target variable in regression (supervised learning)?

  • A . Nominal values like true, false
  • B . Reptile, fish, mammal, amphibian, plant, fungi
  • C . Infinite number of numeric values, such as 0.100, 42.001, 1000.743..
  • D . All of the above

Reveal Solution Hide Solution

Question #38

Which of the following true with regards to the K-Means clustering algorithm?

  • A . Labels are not pre-assigned to each objects in the cluster.
  • B . Labels are pre-assigned to each objects in the cluster.
  • C . It classify the data based on the labels.
  • D . It discovers the center of each cluster.
  • E . It find each objects fall in which particular cluster

Reveal Solution Hide Solution

Question #39

A problem statement is given as below

Hospital records show that of patients suffering from a certain disease, 75% die of it.

What is the probability that of 6 randomly selected patients, 4 will recover?

Which of the following model will you use to solve it?

  • A . Binomial
  • B . Poisson
  • C . Normal
  • D . Any of the above

Reveal Solution Hide Solution

Question #40

A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

Above is an example of

  • A . Linear Regression
  • B . Logistic Regression
  • C . Recommendation system
  • D . Maximum likelihood estimation
  • E . Hierarchical linear models

Reveal Solution Hide Solution

Question #41

Which of the following statement true with regards to Linear Regression Model?

  • A . Ordinary Least Square can be used to estimates the parameters in linear model
  • B . In Linear model, it tries to find multiple lines which can approximate the relationship between the outcome and input variables.
  • C . Ordinary Least Square is a sum of the individual distance between each point and the fitted line of regression model.
  • D . Ordinary Least Square is a sum of the squared individual distance between each point and the fitted line of regression model.

Reveal Solution Hide Solution

Question #42

Select the correct statement which applies to Supervised learning

  • A . We asks the machine to learn from our data when we specify a target variable.
  • B . Lesser machine’s task to only divining some pattern from the input data to get the target variable
  • C . Instead of telling the machine Predict Y for our data X, we’re asking What can you tell me about X?

Reveal Solution Hide Solution

Question #43

A data scientist wants to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level.

What is the most appropriate method for this project?

  • A . Linear regression
  • B . K-means clustering
  • C . Logistic regression
  • D . Apriori algorithm

Reveal Solution Hide Solution

Question #44

Question-34. Stories appear in the front page of Digg as they are "voted up" (rated positively) by the community. As the community becomes larger and more diverse, the promoted stories can better reflect the average interest of the community members.

Which of the following technique is used to make such recommendation engine?

  • A . Naive Bayes classifier
  • B . Collaborative filtering
  • C . Logistic Regression
  • D . Content-based filtering

Reveal Solution Hide Solution

Question #45

What describes a true limitation of Logistic Regression method?

  • A . It does not handle redundant variables well.
  • B . It does not handle missing values well.
  • C . It does not handle correlated variables well.
  • D . It does not have explanatory values.

Reveal Solution Hide Solution

Question #46

Select the choice where Regression algorithms are not best fit

  • A . When the dimension of the object given
  • B . Weight of the person is given
  • C . Temperature in the atmosphere
  • D . Employee status

Reveal Solution Hide Solution

Question #47

Refer to image below

  • A . Option A
  • B . Option B
  • C . Option C
  • D . Option D

Reveal Solution Hide Solution

Question #48

A fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of the

  • A . Presence of the other features.
  • B . Absence of the other features.
  • C . Presence or absence of the other features
  • D . None of the above

Reveal Solution Hide Solution

Question #49

Suppose that the probability that a pedestrian will be tul by a car while crossing the toad at a pedestrian crossing without paying attention to the traffic light is lo be computed. Let H be a discrete random variable taking one value from (Hit. Not Hit). Let L be a discrete random variable taking one value from (Red. Yellow. Green).

Realistically, H will be dependent on L That is, P(H = Hit) and P(H = Not Hit) will take different values depending on whether L is red, yellow or green. A person is. for example, far more likely to be hit by a car when trying to cross while Hie lights for cross traffic are green than if they are red In other words, for any given possible pair of values for Hand L. one must consider the joint probability distribution of H and L to find the probability* of that pair of events occurring together if Hie pedestrian ignores the state of the light Here is a table showing the conditional probabilities of being bit. defending on ibe stale of the lights (Note that the columns in this table must add up to 1 because the probability of being hit oi not hit is 1 regardless of the stale of the light.)

  • A . The marginal probability P(H=Hit) is the sum along the H=Hit row of this joint distribution table, as this is the probability of being hit when the lights are red OR yellow OR green.
  • B . marginal probability that P(H=Not Hit) is the sum of the H=Not Hit row
  • C . marginal probability that P(H=Not Hit) is the sum of the H= Hit row

Reveal Solution Hide Solution

Question #49

Suppose that the probability that a pedestrian will be tul by a car while crossing the toad at a pedestrian crossing without paying attention to the traffic light is lo be computed. Let H be a discrete random variable taking one value from (Hit. Not Hit). Let L be a discrete random variable taking one value from (Red. Yellow. Green).

Realistically, H will be dependent on L That is, P(H = Hit) and P(H = Not Hit) will take different values depending on whether L is red, yellow or green. A person is. for example, far more likely to be hit by a car when trying to cross while Hie lights for cross traffic are green than if they are red In other words, for any given possible pair of values for Hand L. one must consider the joint probability distribution of H and L to find the probability* of that pair of events occurring together if Hie pedestrian ignores the state of the light Here is a table showing the conditional probabilities of being bit. defending on ibe stale of the lights (Note that the columns in this table must add up to 1 because the probability of being hit oi not hit is 1 regardless of the stale of the light.)

  • A . The marginal probability P(H=Hit) is the sum along the H=Hit row of this joint distribution table, as this is the probability of being hit when the lights are red OR yellow OR green.
  • B . marginal probability that P(H=Not Hit) is the sum of the H=Not Hit row
  • C . marginal probability that P(H=Not Hit) is the sum of the H= Hit row

Reveal Solution Hide Solution

Question #49

Suppose that the probability that a pedestrian will be tul by a car while crossing the toad at a pedestrian crossing without paying attention to the traffic light is lo be computed. Let H be a discrete random variable taking one value from (Hit. Not Hit). Let L be a discrete random variable taking one value from (Red. Yellow. Green).

Realistically, H will be dependent on L That is, P(H = Hit) and P(H = Not Hit) will take different values depending on whether L is red, yellow or green. A person is. for example, far more likely to be hit by a car when trying to cross while Hie lights for cross traffic are green than if they are red In other words, for any given possible pair of values for Hand L. one must consider the joint probability distribution of H and L to find the probability* of that pair of events occurring together if Hie pedestrian ignores the state of the light Here is a table showing the conditional probabilities of being bit. defending on ibe stale of the lights (Note that the columns in this table must add up to 1 because the probability of being hit oi not hit is 1 regardless of the stale of the light.)

  • A . The marginal probability P(H=Hit) is the sum along the H=Hit row of this joint distribution table, as this is the probability of being hit when the lights are red OR yellow OR green.
  • B . marginal probability that P(H=Not Hit) is the sum of the H=Not Hit row
  • C . marginal probability that P(H=Not Hit) is the sum of the H= Hit row

Reveal Solution Hide Solution

Question #52

You recommend a movie with three stars but the user loves it (he’d rate it five stars).

So which statement correctly applies?

  • A . In both cases, the contribution to the RMSE is the same
  • B . In both cases, the contribution to the RMSE is the different
  • C . In both cases, the contribution to the RMSE, could varies
  • D . None of the above

Reveal Solution Hide Solution

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments