Extremely Large Scale Graph Representation Learning in Practice

Speaker

Hongxia Yang, Alibaba Group

Time

2018-07-16 14:00:00 ~ 2018-07-16 15:30:00

Location

SEIEE-3-412

Host

Weinan Zhang

Abstract

Extremely large scale graphical model has been playing an increasingly important role in big data companies In particular, graph inference combined with deep learning has achieved successful phased results in many of Alibaba s business scenarios The data of the Alibaba ecosystem is extremely rich and varied, covering everything from shopping, travel, entertainment, and payment We are working on the development of a new generation of graph learning platform that can efficiently perform inference analysis on billions of nodes and billions of edges In this talk, I will share two related works that have been accepted by IJCAI and KDD 2018 respectively: 1 Network representation learning (RL) aims to transform the nodes in a network into low-dimensional vector spaces while preserving the inherent properties of the network Though network RL has been intensively studied, most existing works focus on either network structure or node attribute information In this paper, we propose a novel framework, named ANRL, to incorporate both the network structure and node attribute information in a principled way Specifically, we propose a neighbor enhancement autoencoder to model the node attribute information, which reconstructs its target neighbors instead of itself To capture the network structure, attribute-aware skip-gram model is designed based on the attribute encoder to formulate the correlations between each node and its direct or indirect neighbors We conduct extensive experiments on six real-world networks, including two social networks, two citation networks and two user behavior networks The results empirically show that ANRL can achieve relatively significant gains in node classification and link prediction tasks 2 The e-commerce era is witnessing a rapid increase of mobile Internet users Major e-commerce companies nowadays see billions of mobile accesses every day Hidden in these records are valuable user behavioral characteristics such as their shopping preferences and browsing patterns And, to extract these knowledge from the huge dataset, we need to first link records to the corresponding mobile devices This Mobile Access Records Resolution (MARR) problem is confronted with two major challenges: (1) device identifiers and other attributes in access records might be missing or unreliable; (2) the dataset contains billions of access records from millions of devices To the best of our knowledge, as a novel challenge industrial problem of mobile Internet, no existing method has been developed to resolve entities using mobile device identifiers in such a massive scale To address these issues, we propose a SParse Identifier-linkage Graph (SPI-Graph) accompanied with the abundant mobile device pro ling data to accurately match mobile access records to devices Furthermore, two versions (unsupervised and semi-supervised) of Parallel Graph-based Record Resolution (PGRR) algorithm are developed to effectively exploit the advantages of the large-scale server clusters comprising of more than 1,000 computing nodes We empirically show superior performances of PGRR algorithms in a very challenging and sparse real data set containing 5 28 million nodes and 31 06 million edges

Bio

Dr. Hongxia Yang is working as the Senior Staff Data Scientist and Director in Alibaba Group. Her interests span the areas of Bayesian statistics, time series analysis, spatial-temporal modeling, survival analysis, machine learning, data mining and their applications to problems in business analytics and big data. Current on-going projects in her team include huge dynamic multi-level heterogenous graphical model for user profiling system, large-scale distributed knowledge graph and its efficient inference for data enabling platform and general ensemble prediction framework for various revenue and costs forecasting, among several others. She used to work as the Principal Data Scientist at Yahoo! Inc and Research Staff Member at IBM T.J. Watson Research Center respectively and got her PhD degree in Statistics from Duke University in 2010. She has published over 30 top conference and journal papers and held 9 filed/to be filed US patents and is serving as the associate editor for Applied Stochastic Models in Business and Industry. She has been been elected as an Elected Member of the International Statistical Institute (ISI) in 2017.

Home

Research Areas

Admission

Students

Open Positions / Job Opportunity