Schoolzilla Developer Homework Solution

When presented with the problem of taking user submitted data in a variety of formats, it is important that the data goes through a routine in order to get the data in a standardized format. When provided with the sample data below, a number of classes need to be designed in order to organize these routines into a scalable application structure and guarantee the application functions properly for its users.

Table 1 - Raw Data
Student Number Math Score Science Score
1597530 100 80
2468975 85
8675309 blue 95

Challenge

Build a tool to accept data in a variety of formats from Schoolzilla users. Then, perform the required cleanup and mapping on the Schoolzilla website. The tool should be able to do the following:

  • Map columns from the customer dataset to the columns required for the Schoolzilla data model
  • Display sample rows from the customer dataset
  • Allow values to be changed by the customer
  • Validate that the values entered are appropriate for the Schoolzilla element to which they've been mapped

Question 1: How would you design a solution to this problem?

The solution to the problem will use a series of Python libraries. The most notable of which is the Pandas toolkit, which provides a fast hash table implementation, in-memory data manipulation, and ability to handle the "missing data" problem that is present in the sample data.

The programming challenges that need to be overcome in the sample problem are as follows:

Handle the Upload of a User Submitted File and Confirm Its Validity

To get the data into Schoolzilla, the first step is for the user to upload tabular data. For the purposes of the file upload, the following assumptions are made:

  • The user uploads their raw data files by using a web form which already exists.
  • The user is intended to use Excel as the data format.
  • Each spreadsheet contains column headers in the first row of the spreadsheet.
  • Multiple spreadsheets can be uploaded at one time, but each spreadsheet has data in the same format (i.e. the column headers are identical in terms of placement and number in each of the uploaded files)
  • A web form that shows the confirmation of a previous upload and asks the user if he or she is finished uploading or if there are more files to upload also exists.

The first class will take one or more spreadsheets from a user submission:

class name: UserSubmittedData
argument: uploaded file list
method: initialize ->
  call the validateUpload method
method: validateUpload ->
  if the form submission is not empty ->
    loop over the list of uploaded files with their indices ->
      if the file mimetype is an unaccepted format ->
        add the index of the invalid file to an array
    if the length of the invalid files array is 0 ->
      call the generateDataFrame method
    else ->
      Display a message to the user indicating which file or files are invalid
  else ->
    Display a message to the user asking him or her to upload files containing data
method: generateDataFrame ->
  if a DataFrame does not exist that contains the results of a previous upload ->
    Create a new DataFrame to store the results
  loop through the list of uploaded files ->
    read the file
    if there are no errors ->
      append to the DataFrame
  Call the displayConfirmation method
method: displayConfirmation ->
  display the confirmation form per the final assumption above
method: isFinished ->
  if the user wants to upload more files ->
    validateUpload
  else ->
    return the DataFrame to the application's scope

Map the Data and Establish Cleanup Mechanisms

Assuming all students have a unique ID that is stored in a single column of the table, this column needs to act as the unique ID for the student to be copied to each resulting row. In the example in Table 1, this information maps cleanly to the first column, but in other cases, the ID might be in column 2. The column containing the user ID should be user defined to accommodate the various locations where it can appear.

The specific columns that contain data where the column is not the key ID column are transposed into a row if the data is present, falls within the valid range, and is of the correct data type. This application also assumes that duplicates should be removed, as should the data that does not fall within the valid range.

class name: DataIntegrity
arguments: DataFrame, userSpecifiedData
method: initialize ->
  displayRawDataPreview
  confirmTransformation
  purgeInvalid
method: displayRawDataPreview ->
  load the DataFrame
  load the userSpecifiedData if defined
  allow user to specify the following:
    the column containing data
    the column which acts as the unique ID for the student record
    datatypes for each of the above columns
    valid ranges for each column
  store userSpecifiedData and return to application scope for potential reuse
  call coerceData
method: coerceData ->
  coerce the data using the datatype specified in displayRawDataPreview
method: transformData ->
  transpose the data using pandas.melt()
  show the user a preview of the first 25 rows
method: confirmTransformation ->
  if the user says the data is ok ->
    call dedupData
  else ->
    call displayRawDataPreview
method: dedupData ->
  deduplicate data using pandas.DataFrame.drop_duplicates()
  update the DataFrame
method: purgeInvalid ->
  remove invalid data that does not fall within the valid range
  remove NaN results from data
  return DataFrame to application scope

Editing Data

Per the specs for the application, users should be able to add and edit data. In order to do this, a class will be designed to subclass the class above, in order to take advantage of the data deduplication and removal of invalid data.

This class calls a front-end web form to allow the users to edit, insert, or delete data. This class is designed to handle the update of the DataFrame. Other methods that are inherited from the superclass are then called to deduplicate the data, purge the invalid results, and return the updated DataFrame to the application scope.

class name: EditData
inherits from: DataIntegrity
arguments: DataFrame, userSpecifiedData
method: updateData ->
  allow customer to insert, delete, and edit data

Saving Data

In order to save the data, the data needs to be compared to the Schoolzilla data model. An abstract class will be created to connect to the database using pyodbc to be inherited by the DataSchemaValidation class below.

class name: Database
  connect ->
    open connection
  __del__ ->
    close connection
class name: DataSchemaValidation
arguments: DataFrame, userSpecifiedData
extends: AppDB
method: validateAgainstSchema ->
  compare the DataFrame from the user data
  if data is valid ->
    return is_valid = true
  else ->
    return is_valid = false
method: executeDbQuery ->
  if is_valid ->
    write the data to the database

Question 2: How would you verify that your solution continues to perform well?

There are a number of ways to verify the performance of the application over time. Initially, the methods in each class can be benchmarked by calling them repeatedly in order to identify slow methods. These benchmark tests would be performed toward the end of the development cycle to guarantee performance for each release. These tests are designed to proactively identify issues before they make it to the product user population.

The next way to monitor the performance of the application is to write unit tests to make sure all the conditions of the applications are satisfied. There are a number points in the application where the application is expected to prevent the user from proceeding. These cases include instances where the user attempts to upload invalid data, the mimetype of the uploaded files are invalid, and the data does not match the Schoolzilla schema. Additional unit tests will also be written to ensure that the methods perform according to spec, including making sure the data is successfully transposed by DataIntegrity.transformData or that type can be modified by DataIntegrity.coerceData. The success of these methods can be further verified by using web driver tests with technologies such as PhantomJS or Selenium.

A final method can be achieved by implementing New Relic performance monitoring. This type of performance monitoring is reactive - i.e. the issues need to be present among the user base in order to identify the issues. However, this type of information is invaluable, as it pinpoints slower parts of the application and helps determine candidates for possible refactoring.