Data Integration

DATA CLEANING & INTEGRATION

Preprocessing

Preparing the Data

Chicago Schema

Zip-code Schema

Schools Schema

Methods

Remove NULLs

in Chicago Dataset

Removed the null values from the Chicago Area dataset. Only a few variables were affected. Zipcode and School datasets did not have any null values to address.

Drop Irrelevant Attributes

in School Dataset

Irrelevant attributes were removed from the School dataset for ease of use and to limit confusion.

Effects on Data

There was no valuable information lost by the previous actions to clean the data.

Matchers

Given two schemas S and T :

The matcher(S, T) = similarity matrix

The similarity matrix holds similarity score(s_i, t_j) pair

Merge

Merge of Public School Data and Zipcode Data

We used the Chicago Public Schools Data's attribute, zip-code and Real Estate For Sale by Chicago Zip Code's column zip-code to merge the two tables.

Initial School Data

482 rows

74 columns

Initial Zip-code Data

482 rows

74 columns

Results

Each school in the Public School Dataset has a community area associated with it from the zip-code data.

Final Merged Data

464 rows

44 columns

Jaccard Similarity

Jaccard Similarity

A Jaccard Similarity was done on all of the data column names to see if any of the column names looked promising as pairs.

The most promising columns were Per Capita Income's column, CommunityAreaName with the column in Real Estate For Sale by Chicago Zip Code which contains the community area

This enabled us to double check our work and ensure that these two columns would have instances that matched. Then, the scores from the Jaccard Similarity were used to merge the community area names with high similarity scores.

Name Based Matcher

Name Based Matcher

This determines a match based on the similarity between attribute names.

Each name is represented as a string so the matcher uses a string similarity metric.

The metric for our named-based matcher was Jaccard similarity.

Instance based

determines a match based on how similar the data instances of two given attributes are

each attribute is converted to a set of its data instances

the metric for determining similarity is then the overlap of the sets

Then, Jaccard overlap was used

Instance Based

Combine matchers

after we had the name-based and instance-based matchers scores we combined them for a final combined score

so a match will only be determined if they have similar names as well as similar data instances

Combine Matchers

Caruso

Magliente

Palegar

Poyer

DATA CLEANING & INTEGRATION

Preparing the Data

Chicago Schema

Zip-code Schema

Schools Schema

Remove NULLs

in Chicago Dataset

Drop Irrelevant Attributes

in School Dataset

Effects on Data

Matchers

Merge of Public School Data and Zipcode Data

Initial School Data

482 rows

74 columns

Initial Zip-code Data

482 rows

74 columns

Results

Final Merged Data

464 rows

44 columns

Jaccard Similarity

Name Based Matcher

Instance based

Combine matchers

Final Data

Caruso

Magliente

Palegar

Poyer

LINKS

CONTRIBUTORS