Contextual Spatial Outlier Detection with Metric Learning

Project description

This project is conducted by Guanjie Zheng, Susan L. Brantley and Zhenhui (Jessie) Li from Pennsylvania State University.

The project is motivated by a real world envioronmental problem -- hydraulic fracking. Hydraulic fracturing (or “fracking”) is a revolutionary well stimulation technique for shale gas extraction, but has spawned controversy in environmental contamination. If methane from gas wells leaks extensively, this greenhouse gas can impact drinking water wells and enhance global warming. Recently, some geologists raise the concern that fracking can pollute the nearby groundwater and air.

So we propose to use data mining technique to find anomalous water or air samples which may indicate potential leakage. We further extend this to a general contextual spatial outlier detection problem, which can deal with many spatial datasets.

We developed a contextual spatial outlier detection technique, using both spatial and non-spatial contextual attributes to find contextual neighbors. We combine the attributes using robust metric learning.

We conduct extensive experiments on 5 real datasets and our method outperforms the baseline on all of these datasets.

Dataset

We used five datasets to demonstrate that our method has a better performance. These datasets are:

Water:

This dataset contains 1,645 data samples of the methane concentration in groundwater in Pennsylvania. Each sample has 11 contextual attributes (i.e., features) describing sampling location and nearby emission sources. The behavioral attribute is the methane concentration measured in groundwater sample, which ranges from 1 to 46,500 ppb. The outliers found in this dataset could potentially indicate shale gas leakage problem.

Air:

This dataset contains 34,100 data samples of the methane concentration in the atmosphere collected in the United States. There are 30 contextual attributes describing the meteorology, geography, and emission sources near the sampling locations. Similar to water dataset, the behavioral attribute is methane concentration measured in atmosphere. The methane concentration values range from 0 to 1,420 ppb.

Zillow:

This dataset contains 1,511 house selling records in State College, Pennsylvania, from year 2014 to 2016. The sold price ranges from $100,000 to $975,000. The contextual attributes describing the real estate properties include latitude, longitude, square feet, year of built (7 attributes in total) We use the most recent sold price as the behavioral attribute. This data is collected from Zillow API.

El Nino:

This dataset contains 93,935 samples in the equatorial Pacific. The contextual attributes are oceanographic and surface meteorological variables (6 contextual attributes in total) and the behavioral attribute is sea surface temperature. This dataset is downloaded from UCI repository.

Hydro:

This dataset contains 308 records describing the relationship between the shape of a ship and the residuary resistance that the ship bares in water. The longitudinal position of the buoyancy and five shape parameters of the ship are used as contextual attributes. The behavioral attribute is the residuary resistance. This dataset is downloaded from UCI repository.

Code

The code is here. Please run "bash runexp.sh" in the experiment/outlier_detection/python folder.