Assistant Professor
Canada CIFAR AI Chair
School of Computer Science
McGill University
Email:
CV ·
Google Scholar Profile ·
DBLP
About
I am an Assistant Professor and Canada CIFAR AI Chair in the School of Computer Science at McGill University and at Mila  Quebec AI Institute.I finished my Ph.D. in Computer and Information Science at the University of Pennsylvania in 2020, advised by Prof. Mayur Naik. I received my M.S. in computer science from Vanderbilt University in 2014, before which I obtained my B.E. (with Honors) from Nankai Unversity in 2011. I spent the summer of 2019 as a research scientist intern at DeepMind working with Pushmeet Kohli in the Robust AI team.
News
 Sep. 2021 Sever's work on Symbol Grounding for SATNet is accepted (as a splotlight) by NeurIPS 2021. Congrats Sever!
 Sep. 2021 Joint work on Scalable Differentiable Reasoning is accepted by NeurIPS 2021.
 July 2021 Zhaoyu's work on using Deep Constrative Learning for Theorem Proving is accepted (as an oral) by ICML21 SSL workshop.
 July 2021 Joint work on Datadriven approaches for Inductive Generalization is accepted by FMCAD 2021.
 April 2021 NSERC Discovery Grants application is awarded!
 April 2021 NSERC Discovery Launch Supplement for Early Career Researchers is also award!

Jan. 2021 Appointed as one of 29 New Canada CIFAR AI Chairs.
Interests
My research interests span programming languages, software engineering, security, and artificial intelligence.
My research studies deep learning and reinforcement learning techniques, as well as, classic numerical, statistical, and probabilistic methods
for addressing program reasoning challenges that span verification, synthesis, and testing.
I am also interested in designing logicinspired neural architectures and neurosymbolic systems to tackle broader learning and reasoning challenges,
especially, interpretable, dataefficient, and algorithmic learning.
Artifacts
I enjoy building usable artifacts that validate my research ideas. All tools and datasets that have been developed during my research are opensource and available here.
Research
My research to date has focused on improving software quality by advancing the stateoftheart in static analysis, verification, synthesis, and testing using AIbased techniques ranging from symbolic reasoning, constraint solving, statistical, and probabilistic models, to recent approaches for deep learning and reinforcement learning. At the same time, these program reasoning challenges serve as a rich resource of inspirations for me to advance AI techniques, e.g. scalable inference, logical rule learnging, representation learning, and reinforcement learning with extremely sparse rewards.
Realworld programs are large and complex, which involves many third party libraries that may lack specification or even source code, and new features or patches are being introduced regularly. Plus wellknown limitations such as undecidability, static analysis tools are forced to make approximations, which in turn result in large numbers of false alarms. My research combines statistical and logical methods to address these complexities and uncertainties.

Automatically learning API specifications from "big code". [USENIX Security'16]
This work uses statistical methods to infer API specifications from complex software systems like Linux kenerl, OpenSSL library, PHP and Python. The resulting opensource prototype APISan has found 76 previously unknown API misuse bugs in these systems. APISan does not need access to API's source code or binary; instead, it observes how and in which context APIs are used. 
Improving analysis accuracy by learning from user feedback.
[OOPSLA'17,
MAPL17,
PLDI'19]
A sound static analysis produces false alarms that are unavoidable by the analysis designer. The analysis user, who usually implements the code, has much more information than the analysis designer. Incorporating feedback from the users opens up a new dimension to improve the analysis. However, user can occasionally make mistakes. The insight is to reduce rulebased static analysis into probabilistic models, such as Markov logic network and Bayesian network, so that even noisy feedback can improve analysis accuracy significantly.
Inference and learning are two key components of machine learning applications. After reducing static analysis into probabilistic models, user feedback is incorporated by performing Maximum A Posteriori probability estimate, or MAP inference. Offtheshelf inference engines cannot scale to static analysis applications. My research improves inference by a systematic constraint solving process. Furthermore, inference helps to learn rule weights but not rules that usually designed by experts. To learn logical rules from data, my research proposes a scalable logic program synthesis framework, which not only improves static analysis but also applies to other domains like bigdata analytics and software defined networks.

Scalable inference via systematic constraint solving.
[CP'16,
CAV'17]
MAP inference of Markov logic network is essentially a MaxSAT problem, which can be solved in two phases: first, ground logical rules into a set of weighted constraints; then solve weighted constraints by a MaxSAT solver. However, naively doing so will be extremely inefficient. Both phases can be optimized. From the perspective of application, constraints do not have to be solved entirely, which allows us to have a combined topdown (exploits laziness) and bottomup (exploits eagerness) grounding. From the perspective of the underlying solver, which is invoked many times instead of only once, so that incremental solving is feasible and efficient. 
Synthesizing rich declarative logic programs.
[FSE'18]
Declarative logic programs, in particular Datalog programs, have witnessed promising applications in various domains including static analysis due to its rich expressivity, and also for the exact same reason, its synthesis is a fairly challenging task. This work proposes a systematic template augmentation technique, which enables synthesis of arbitrary rules but without significantly enlarging the search space. Furthermore, we synergistically combines a bidirectional search over version space with active learning so that efficient search is achieved and the number of required examples is minimized. 
Scalable synthesis via numerical relaxation.
[ML4P'18, IJCAI'19]
Inspired by the fact that numerical relaxation technique is widely used and remarkably successful in optimization problems, this work proposes a numerical relaxation technique for logic programs, which is fundamentally different from traditional search based approaches. This work demonstrates that gradientbased and stochastic approaches can be used to learn logic programs very efficiently.
Deep learning sparks a remarkable revolution in many challenging fields of science and engineering. Deep learning combined with reinforcement learning significantly outperforms human experts on video and board games by directly learning from raw pixels and selfplay. However, can this technique help to solve programming challenges? My thesis work aims to answer this very question, which involves addressing two fundamental AI challenges: representation learning of structured data with rich semantics, and the interplay between neural networks and symbolic reasoning.
 Program verification by deep reinforcement learning. [NeurIPS'18, CAV'20]
This work concerns attacking the fundamental challenge of automated software verification, which is finding a strong enough loop invariant for a given loop. We develop an endtoend learning framework, Code2Inv, which takes a piece of code as input, then automatically learns and improves a neural policy by interacting with a checker, and finally produces an invariant so that the verification can be done successfully. To achive so, we prose a novel representation learning technique for programs so that the semantics (rather than syntax) can be captured by neural networks, and two reward mechansims so that reinforcement learning could make progress from extremely sparse reward.  Metalearning on synthesis. [ICLR'19]
This work investigates the capability of deep learning and reinforcement learning in a much broader domain, namely syntax guided program synthesis (SyGuS), where each synthesis task has its own domain specific language, making manually designing heuristics for each task prohibitive. This also makes learning across different tasks extremely difficult as generalizing across different languages (i.e. learn from one language and then test on a different language) is near the limit of machine learning and has been rarely explored so far. Our attempt in this metalearning direction and demonstrates quite promising results — the learned neural representation as well as policy can be transferred to (i.e. help to accelerate) synthesis tasks that use similar but different grammars.  Learningaided reasoning. [FMCAD'21]
Though endtoend learning has significant potential as shown by my recent research, it is impractical to replace classic approaches with endtoend learning for various reasons including interpretability, training difficulty, and perhaps more importantly data inefficiency. Instead, deep learning should be integrated with classic approaches so that there is no need to learn everything from scratch. I envision that the future program reasoning techniques and the underlying constraint solving techniques (e.g. SAT, SMT) will be equipped with a learning component, which can be automatically and continuously improved. One of my ongoing work is introducing learningbased techniques into the stateoftheart software verification and testing tool chain such as SeaHorn and KLEE.
Unartificial intelligence, like human intelligence, often directly process perceptual data (e.g., images, sounds, handwritten texts). Although these data do have interesting structures and rich semantics, automatically extracting them out remains an unsolved challenge. Do we (as homo sapiens) need to solve this challenge? The answer is unfortunately yes, even if every homo sapien masters Computational Thinking, e.g., becomes an excellent algorithm designer and professional programmer, because algorithm design and programming will only work after some "seeminly easy" manual preprocessing. This line of my research is to tackle this "seemingly easy" part jointly with algorithmic learning (i.e., learning efficient and interpretable algorithms).
Deep neural networks, especially convolutational neural networks (CNN), have achieved remarkable successes in processing images. Beyond object recognition, images often contain rich semantics (e.g., physics laws, traffic laws, a lovely story, etc.). Neural approaches like CNNs are unlikely to capture such rich semantics. We believe either more specialized (e.g., logic insipired) architectures or a neurosymboliic design which directly incorporates a logical component are needed. SGSANet (NeurIPS'21a) is a prototype that learn and solve constraints from raw pixels in an endtoend fashion, which is possible due to one key component called Symbol Grounding, which explicitly encourages (but without direct supervision) the model to associate a group of pixels with a unique symbol. Scallop (NeurIPS'21b) is another prototype that integerates neural components with probabilistic logical rules. Scallop enables logical reasoning over raw pixels as well as an external knowledge base, and some immediate benefits are interpretable predictions and great data efficiency.
Publications

Techniques for Symbol Grounding with SATNet.
Sever Topan, David Rolnick, Xujie Si
NeurIPS 2021 [paper, slides, artifact]
Spotlight

Scallop: From Probabilistic Deductive Databases to Scalable Differentiable Reasoning.
Jiani Huang, Ziyang Li, Binghong Chen, Karan Samel, Mayur Naik, Le Song, Xujie Si
NeurIPS 2021 [paper, code]

Datadriven Optimization of Inductive Generalization.
Nham Le, Xujie Si, Arie Gurfinkel
FMCAD 2021 [paper, slides, artifact] 
Code2Inv: A Deep Learning Framework for Program Verification.
Xujie Si, Aaditya Naik, Hanjun Dai, Mayur Naik, Le Song
CAV 2020 [paper, slides, artifact, demo] 
Synthesizing Datalog Programs using Numerical Relaxation.
Xujie Si, Mukund Raghothaman, Kihong Heo, Mayur Naik
IJCAI 2019 [paper, slides, poster, artifact]

Learning a MetaSolver for SyntaxGuided Program Synthesis.
Xujie Si, Yuan Yang, Hanjun Dai, Mayur Naik, Le Song
ICLR 2019 [paper, poster, artifact]

Continuously Reasoning about Programs via Differential Bayesian Inference.
Kihong Heo, Mukund Raghothaman, Xujie Si, Mayur Naik
PLDI 2019 [paper, slides, artifact, video abstract]
ACM SIGPLAN Distinguished Paper Award

SyntaxGuided Synthesis of Datalog Programs.
Xujie Si, Woosuk Lee, Richard Zhang, Aws Albarghouthi, Paris Koutris, Mayur Naik
FSE 2018 [paper, slides, artifact]

Learning Loop Invariants for Program Verification.
Xujie Si, Hanjun Dai, Mukund Raghothaman, Mayur Naik, Le Song
NeurIPS 2018 [paper, slides, poster, artifact, video abstract]
Spotlight

Maximum Satisfiability in Software Analysis: Applications and Techniques
Mayur Naik, Xujie Si, Xin Zhang, Radu Grigore
VMCAI Invited Tutorial 2018 [abstract, slides]

Effective Interactive Resolution of Static Analysis Alarms
Xin Zhang, Radu Grigore, Xujie Si, Mayur Naik
OOPSLA 2017 [paper] 
Maximum Satisfiability in Software Analysis: Applications and Techniques
Xujie Si, Xin Zhang, Radu Grigore, Mayur Naik
CAV 2017 [paper, slides]
Invited Tutorial

APISan: Sanitizing API Usages through Semantic Crosschecking
Insu Yun, Changwoo Min, Xujie Si, Yeongjin Jang, Taesoo Kim, Mayur Naik
USENIX Security 2016 [paper, slides, code]
CSAW Best Applied Research Paper Finalist

On Incremental CoreGuided MaxSAT Solving
Xujie Si, Xin Zhang, Vasco Manquinho, Mikolas Janota, Alexey Ignatiev, Mayur Naik
CP 2016 [paper, slides, code]
Workshop

Graph Contrastive Pretraining for Effective Theorem Reasoning. (Oral Presentation)
Zhaoyu Li, Binghong Chen, Xujie Si
ICML 2021 Workshop: SelfSupervised Learning for Reasoning and Perception 
Difflog: Beyond Deductive Methods in Program Analysis
Mukund Raghothaman, Sulekha Kulkarni, Richard Zhang, Xujie Si, Kihong Heo, Woosuk Lee, Mayur Naik
Machine Learning for Programming Workshop 2018 [paper] 
Combining the Logical and the Probabilistic in Program Analysis
Xin Zhang, Xujie Si, Mayur Naik
ACM SIGPLAN Workshop on Machine Learning and Programming Languages 2017
[paper, slides]
Teaching
 Fall 2021: COMP597 Automated Reasoning with Machine Learning
 Winter 2021: COMP302 Programming Languages & Paradigms
Service

MAPL'20, APLAS'21, PLDI'22, Program Committee

NeurIPS'20, AAAI'21, ICLR'21, NeurIPS'21, AAAI'22, ICLR'22, Reviewer

ICFP'19, CAV'19, CAV'20, Artifact Evaluation Committee

PLDI'18, '19, Student Volunteer CoChair

POPL'16, Student Volunteer