DATA CLEANING & INTEGRATION
Preparing the Data
Chicago Schema
Zip-code Schema
Schools Schema
Remove NULLs
in Chicago Dataset
Removed the null values from the Chicago Area dataset. Only a few variables were affected. Zipcode and School datasets did not have any null values to address.
Drop Irrelevant Attributes
in School Dataset
Irrelevant attributes were removed from the School dataset for ease of use and to limit confusion.
Effects on Data
There was no valuable information lost by the previous actions to clean the data.
Matchers
Given two schemas S and T :
​
The matcher(S, T) = similarity matrix
The similarity matrix holds similarity score(s_i, t_j) pair





Merge of Public School Data and Zipcode Data
We used the Chicago Public Schools Data's attribute, zip-code and Real Estate For Sale by Chicago Zip Code's column zip-code to merge the two tables.
Initial School Data
482 rows
74 columns
Initial Zip-code Data
482 rows
74 columns
Results
Each school in the Public School Dataset has a community area associated with it from the zip-code data.
Final Merged Data
464 rows
44 columns

Jaccard Similarity
A Jaccard Similarity was done on all of the data column names to see if any of the column names looked promising as pairs.
The most promising columns were Per Capita Income's column, CommunityAreaName with the column in Real Estate For Sale by Chicago Zip Code which contains the community area
This enabled us to double check our work and ensure that these two columns would have instances that matched. Then, the scores from the Jaccard Similarity were used to merge the community area names with high similarity scores.


Name Based Matcher
This determines a match based on the similarity between attribute names.
​
Each name is represented as a string so the matcher uses a string similarity metric.
​
The metric for our named-based matcher was Jaccard similarity.

Instance based
determines a match based on how similar the data instances of two given attributes are
​
each attribute is converted to a set of its data instances
the metric for determining similarity is then the overlap of the sets
​
Then, Jaccard overlap was used
Combine matchers
after we had the name-based and instance-based matchers scores we combined them for a final combined score
so a match will only be determined if they have similar names as well as similar data instances