top of page

DATA CLEANING & INTEGRATION 

Preprocessing

Preparing the Data 

Chicago Schema

Zip-code Schema

Schools Schema

Methods

Remove NULLs 

in Chicago Dataset

Removed the null values from the Chicago Area dataset. Only a few variables were affected. Zipcode and School datasets did not have any null values to address.  

Drop Irrelevant Attributes

 in School Dataset

Irrelevant attributes were removed from the School dataset for ease of use and to limit confusion. 

Effects on Data

There was no valuable information lost by the previous actions to clean the data. 

Matchers

Given two schemas S and T : 

​

The matcher(S, T) = similarity matrix

The similarity matrix holds similarity score(s_i, t_j) pair

Screen Shot 2019-12-12 at 10.06.45 PM.pn
Merge
Merge of Public School Data and Zipcode Data

We used the Chicago Public Schools Data's attribute, zip-code and Real Estate For Sale by Chicago Zip Code's column zip-code to merge the two tables.

Initial  School Data

482 rows

74 columns

Initial  Zip-code Data

482 rows

74 columns

Results

Each school in the Public School Dataset has a community area associated with it from the zip-code data.

Final Merged Data

464 rows

44 columns

Jaccard Similarity

Jaccard Similarity

A Jaccard Similarity was done on all of the data column names to see if any of the column names looked promising as pairs.

 

The most promising columns were Per Capita Income's column, CommunityAreaName with the column in Real Estate For Sale by Chicago Zip Code which contains the community area

This enabled us to double check our work and ensure that these two columns would have instances that matched. Then, the scores from the Jaccard Similarity were used to merge the community area names with high similarity scores. 

Name Based Matcher

Name Based Matcher 

This determines a match based on the similarity between attribute names.

​

Each name is represented as a string so the matcher uses a string similarity metric.

​

The metric for our named-based matcher was Jaccard similarity.

Instance based

determines a match based on how similar the data instances of two given attributes are

​

each attribute is converted to a set of its data instances

the metric for determining similarity is then the overlap of the sets

​

Then, Jaccard overlap was used

Instance Based

Combine matchers

after we had the name-based and instance-based matchers scores we combined them for a final combined score

 

so a match will only be determined if they have similar names as well as similar data instances 

Combine Matchers

Final Data

bottom of page