Schoolzilla Developer Homework Solution
When presented with the problem of taking user submitted data in a variety of formats, it is important that the data goes through a routine in order to get the data in a standardized format. When provided with the sample data below, a number of classes need to be designed in order to organize these routines into a scalable application structure and guarantee the application functions properly for its users.
|Student Number||Math Score||Science Score|
Build a tool to accept data in a variety of formats from Schoolzilla users. Then, perform the required cleanup and mapping on the Schoolzilla website. The tool should be able to do the following:
- Map columns from the customer dataset to the columns required for the Schoolzilla data model
- Display sample rows from the customer dataset
- Allow values to be changed by the customer
- Validate that the values entered are appropriate for the Schoolzilla element to which they've been mapped
Question 1: How would you design a solution to this problem?
The solution to the problem will use a series of Python libraries. The most notable of which is the Pandas toolkit, which provides a fast hash table implementation, in-memory data manipulation, and ability to handle the "missing data" problem that is present in the sample data.
The programming challenges that need to be overcome in the sample problem are as follows:
Handle the Upload of a User Submitted File and Confirm Its Validity
To get the data into Schoolzilla, the first step is for the user to upload tabular data. For the purposes of the file upload, the following assumptions are made:
- The user uploads their raw data files by using a web form which already exists.
- The user is intended to use Excel as the data format.
- Each spreadsheet contains column headers in the first row of the spreadsheet.
- Multiple spreadsheets can be uploaded at one time, but each spreadsheet has data in the same format (i.e. the column headers are identical in terms of placement and number in each of the uploaded files)
- A web form that shows the confirmation of a previous upload and asks the user if he or she is finished uploading or if there are more files to upload also exists.
The first class will take one or more spreadsheets from a user submission:
class name: UserSubmittedData argument: uploaded file list method: initialize -> call the validateUpload method method: validateUpload -> if the form submission is not empty -> loop over the list of uploaded files with their indices -> if the file mimetype is an unaccepted format -> add the index of the invalid file to an array if the length of the invalid files array is 0 -> call the generateDataFrame method else -> Display a message to the user indicating which file or files are invalid else -> Display a message to the user asking him or her to upload files containing data method: generateDataFrame -> if a DataFrame does not exist that contains the results of a previous upload -> Create a new DataFrame to store the results loop through the list of uploaded files -> read the file if there are no errors -> append to the DataFrame Call the displayConfirmation method method: displayConfirmation -> display the confirmation form per the final assumption above method: isFinished -> if the user wants to upload more files -> validateUpload else -> return the DataFrame to the application's scope
Map the Data and Establish Cleanup Mechanisms
Assuming all students have a unique ID that is stored in a single column of the table, this column needs to act as the unique ID for the student to be copied to each resulting row. In the example in Table 1, this information maps cleanly to the first column, but in other cases, the ID might be in column 2. The column containing the user ID should be user defined to accommodate the various locations where it can appear.
The specific columns that contain data where the column is not the key ID column are transposed into a row if the data is present, falls within the valid range, and is of the correct data type. This application also assumes that duplicates should be removed, as should the data that does not fall within the valid range.
class name: DataIntegrity arguments: DataFrame, userSpecifiedData method: initialize -> displayRawDataPreview confirmTransformation purgeInvalid method: displayRawDataPreview -> load the DataFrame load the userSpecifiedData if defined allow user to specify the following: the column containing data the column which acts as the unique ID for the student record datatypes for each of the above columns valid ranges for each column store userSpecifiedData and return to application scope for potential reuse call coerceData method: coerceData -> coerce the data using the datatype specified in displayRawDataPreview method: transformData -> transpose the data using pandas.melt() show the user a preview of the first 25 rows method: confirmTransformation -> if the user says the data is ok -> call dedupData else -> call displayRawDataPreview method: dedupData -> deduplicate data using pandas.DataFrame.drop_duplicates() update the DataFrame method: purgeInvalid -> remove invalid data that does not fall within the valid range remove NaN results from data return DataFrame to application scope
Per the specs for the application, users should be able to add and edit data. In order to do this, a class will be designed to subclass the class above, in order to take advantage of the data deduplication and removal of invalid data.
This class calls a front-end web form to allow the users to edit, insert, or delete data. This class is designed to handle the update of the DataFrame. Other methods that are inherited from the superclass are then called to deduplicate the data, purge the invalid results, and return the updated DataFrame to the application scope.
class name: EditData inherits from: DataIntegrity arguments: DataFrame, userSpecifiedData method: updateData -> allow customer to insert, delete, and edit data
In order to save the data, the data needs to be compared to the Schoolzilla data model. An abstract class will
be created to connect to the database using
pyodbc to be inherited by the
DataSchemaValidation class below.
class name: Database connect -> open connection __del__ -> close connection
class name: DataSchemaValidation arguments: DataFrame, userSpecifiedData extends: AppDB method: validateAgainstSchema -> compare the DataFrame from the user data if data is valid -> return is_valid = true else -> return is_valid = false method: executeDbQuery -> if is_valid -> write the data to the database
Question 2: How would you verify that your solution continues to perform well?
There are a number of ways to verify the performance of the application over time. Initially, the methods in each class can be benchmarked by calling them repeatedly in order to identify slow methods. These benchmark tests would be performed toward the end of the development cycle to guarantee performance for each release. These tests are designed to proactively identify issues before they make it to the product user population.
The next way to monitor the performance of the application is to write unit tests to make sure
all the conditions of the applications are satisfied. There are a number points in the application where
the application is expected to prevent the user from proceeding. These cases include instances where the
user attempts to upload invalid data, the mimetype of the uploaded files are invalid, and the data does
not match the Schoolzilla schema. Additional unit tests will also be written to ensure that the methods
perform according to spec, including making sure the data is successfully transposed by
DataIntegrity.transformData or that type can be modified by
DataIntegrity.coerceData. The success of these
methods can be further verified by using web driver tests with technologies such as PhantomJS or Selenium.
A final method can be achieved by implementing New Relic performance monitoring. This type of performance monitoring is reactive - i.e. the issues need to be present among the user base in order to identify the issues. However, this type of information is invaluable, as it pinpoints slower parts of the application and helps determine candidates for possible refactoring.