2019 TVM and Deep Learning Compilation Conference: Morning Keynote & Session 1


LUIS CEZE: All right. Good morning, everyone. Good morning. I see some more
people coming in. Please take your– take a seat. Thank you all for coming in this
crisp Seattle typical winter day. So it’s really,
really a pleasure to welcome you to the second TVM
and Deep Learning Compilation Conference. My name is Luis Ceze. I’m going to be your
MC in the keynote. And we’re going to
have a fun keynote. It’s not going to be just one
person talking for an hour. We’re going to like super
high-frequency context switches here. And first, I wanted
to say that we’re super-pleased in how the
TVM-related events are going. This is actually our
third event at UW. This is the second conference. The first one was
a small workshop. We also had a couple of
meetups in the Bay Area and in Shanghai this year that
were also pretty well attended. And this year, we had 200 people
attending, which is pretty– it’s pretty great. It’s great to see. And seeing the growth, I’m so
glad– if you look to the right there, there is a stadium. That’s where we put in a
reservation for the stadium for the next meetup next year. OK. So I wanted to reflect just
a bit on why we are here. So in the machine
learning era as opposed to the software era where
we just write code and run, the way we solve
problems today, we collect data and
model templates. We train these models in
the fastest possible machine you can find, the most
expensive typically. And then you want
to run it on a– you run inference on a fast
enough and cheap machine. But if you look at the
trends in terms of inference, so there is interesting data
that [INAUDIBLE] from Purdue and Micron have been collecting
on the trends of model sizes in terms of parameters and
also computational cost. It’s growing super-fast,
both model size and computational cost. It’s kind of amazing,
mind-boggling to see models not just
with hundreds of millions of parameters but also
approaching 50 to 100 billion ops for one single inference. So that’s inference. But if you look at
training, Open AI show this plot and other people
have been showing so far. But I want to add a twist to it. So what this plot shows
is a computational cost of training a state of the
art machine learning models. Well, it’s interesting here
that the trends roughly computational cost in numbers
of ops is growing roughly 10X per year. And now let’s look at the
one at the top right there, the alpha goes 0. So if you were to
train that EC2 today, it would cost
about $1.5 million. Now to put that in
perspective, let’s see what you can buy with $1.5 million. So here’s one– [LAUGHTER] –here’s one home
in the Bay Area. Here’s what– so
now you can choose. You either train that– oops. So you either train that
model, or you buy this house. So I’ll let you make this– make a choice. So we can do a
show of hands here. Who would train the model? All right. OK. Who would buy the house? Oh, interesting. More people would
train the model. All right. Cool. Then we had the
right crowd here. But I’m sure you’ve
been following all of the comments people have
been making about the carbon footprint of AI. Think AI’s probably getting
too much of a bad rap on this. It’s not just AI. A lot of interesting
computational problems have significant
carbon footprints. But AI specifically, even
so popular is getting– is getting quite a bit of
attention because of its energy cost. But with all this aside,
it gets even more serious. This is for the computer
architects in the room here. You’ve seen this
plot many times. And what this is showing is
that even though we’re still getting more and more
transistors on a chip, we have a single track
performance tapering for a while. And even number of cores is
starting to taper and maximum frequency, typical
power that you can– that you can afford to
pump into a single part. Interesting time. If you were to plot the
computational cost of machine learning, it’s
growing much faster than anything else you see here. So there is a big oops
that we need somehow bridge the gap of
these two exponentials. So I think it’s fair to say that
the impact of machine learning would be limited if we don’t
squeeze as much efficiency as we can. And it’s really– that really
means optimizing models, optimizing software and
hardware is really, really key. And with any mind,
I mean, there’s this perfect storm
brewing right now. So we have a Cambrian
explosion of useful models that we care about, a bunch of
requirements on top of them– these models to be useful. They have to have the right
cost, the right latency, power requirements. Security and privacy
is now getting more and more important. And we have a second Cambrian
explosion of hardware targets. So this is because of
Dennard scaling limitations and we just say Moore’s
law approaching its limits. There’s just a lot of incentives
to do hard [INAUDIBLE].. And all of the
major IT companies are building specialized
hardware for machine learning. And there are quite a few
startups, like almost one every other week, building new
chips for machine learning. And then in the
middle of this all, bridging these models and
these hardware back ends, there is this growing–
fast growing but also quickly fragmenting software
ecosystem on top of this. That leads to a really
interesting scenario where we have this
growing set of– this is what we’re showing here
is that Thierry Moreau put this really nice
plot here showing some of the major
deep-learning-focused software stack packages that appear. And it’s kind of
amazing that Halide, who was really transformational in
thinking about how we decoupled a schedule from program. Started only six years ago. And then, in 2017, we had over
a dozen major packages released. And now it’s kind of
slowing down a bit. Hopefully, that’s because
it’s some fragment– some consolidation going on. And people are learning
a lot from this diversity and focusing on what
actually brings most impact. But a lot of these require
a lot of hand-tuning. So despite all of these
efforts in bridging the gap between
models and hardware, you have to do significant
amount of hand-tuning. So having full automation
will be really, really nice. And that’s really what we
set off to do with TVM. So TVM has been around for,
let’s say, almost three years now. And it’s going strong. And here’s one
picture of what we see as a current dominant deep
learning system landscape, all the way from
orchestrators on the top that orchestrate execution
of inference and training in the cloud and in frameworks. Deep learning compilers
that all of the major– a lot of the major hardware
and system providers are building their
own compilers. And then, specifically,
the kernel libraries built by hardware manufacturers
that are often hand-tuned. And then a bunch of
hardware at the bottom. What we set off
to do with TVM is to see if we can actually bridge
from framework down to hardware directly with as much
automation as possible. So we definitely want
to contrast the fact that a lot of the
effort today in getting modest and efficient hardware
has been largely focused on a platform-specific way. If we can alternate the process
of going from framework to how things run in the
actual hardware, we can actually increase reach,
decrease platform dependence, more performance
portability and so on. And a key to doing that the
way we see it is really, I think, this drawing
hands from Azure shows how we think about
this, is that we see– using machine learning for
better machine learning systems is key to do
this because the design complexity got to a
point where, I mean, we can’t do
significant automation and significant really, really,
really good tools to reason about large parameter
spaces and so on. So that includes things like
how do you do mod optimization strategies and find
the right parameters; efficient operator
implementations, like what auto TVM does– you’re
going to hear more about this today– data communication patterns;
model-hardware code tuning; searching for efficient
even hardware designs. I think it’s fair
to say that we’re getting close to being
able to going from a model straight down to hardware design
that serves that model really, really well. So now let’s just reflect a bit
on what happened this past year in the [INAUDIBLE]. And then we’re going to
do a context switch here. So in this past year, we
increased model coverage quite a bit. Now with the close PyTorch
integration and the RelayVM, which you’re going to
hear more about today, we’re able to now run models
like BERT and single shot detection. So this is at the very top. We increase model
coverage quite a bit. But also more hardware backend. We have major hardware back
end using production now– CortexM and RISC-V,
and even some DSP cores we’re going to
hear more about today. And then, in between, we
also added more optimizations in the stack, things like
automatic quantization and more data layout
optimizations as well. But I would say that one of the
things that is most exciting is the fact that we’re putting
quite a bit of effort now as a community in usability. This is more tutorials. We ran a tutorial at
CRC this year that had– was very well attended. And we put up quite
a bit of effort in having tutorials that
are easy to run and follow. And we recorded videos people
have been using and so on. But community
development in general has been going super-well. So [INAUDIBLE] is going to
talk more about this later in the keynote session here. But from last year,
there was 70% growth in the number of
contributors to the project. And today, it was just
under 300 people actually writing code, which is really
great both from academia and industry. And then it’s being
hard– it’s being hardened by industrial users. We are very, very
thankful [INAUDIBLE] partners that contributed
the code and also tried out and see what works and
what doesn’t and help us harden this. We also incubated as
an Apache project. We find it super-important,
especially for more widespread adoption. So incubating as
Apache enables– gives independent
governance, not just a bunch of people sitting
in a room somewhere here. It’s actually people that
are invested in it’s success and use it and depend on
it, have a saying where the project goes. But also, most
importantly, it allows competitors to collaborate. And I think this is
super-important, especially when you’re talking about
open-source compilation optimization stacks. So with that, I would say like a
big thank you to the community. We really, really–
like nothing that happened would be possible
without everyone’s contributions. So this is really,
really, really key. So thank you. With that, now, starting
from the hardware, we’re going to pass the
token to Jeff Gehlhaar who is at Qualcomm and just fresh
off of the plane from Hawaii where they had some
fun announcements. JEFF GEHLHAAR: Yeah. Thank you. Thank you, Luis. Thank you always. Again, I’ll echo. I was here last year. And it just feels like
there’s so many more people and so much more excitement
about what we’re doing. OK. Real quickly. Who am I? VP technology at Qualcomm. I run Qualcomm’s AI
software project. I recognize a lot of
faces in the room. And what I wanted to do is
just give you a broad overview of kind of what we do, how we
think about AI at Qualcomm, and then also a little bit
about the problem as Luis just highlighted– the how do you– how do you deal
with hand-tuning all these different architectures? So maybe I’ll do this first
to put it into context. You can see the recorded videos. We announced two new flagship
parts yesterday in Hawaii, actually starting on Tuesday,
the Snapdragon 5G 765 fully integrated– you
can see these parts later. I have them with me– and our flagship part,
the 865, also a 5G device. You can read about
the rest of it online. But when we think about what are
we going to do with these parts and in particular with AI in
general and then specifically, TVM, we announced a cloud
part to Luis’s comment about what Open AI said
about performance, scaling, and power. We’re not the only ones
who understand that. But we have a history in high
performance, low power compute. And so we think we can
bring something to it. And then 5G to connect
it, and then Edge, AI. So we’re thinking
about the world in a way where
everything’s connected and everything’s running AI
and making everybody’s lives better. And the way we do that is we do
have a Qualcomm research team. We have, of course, a product
team that builds chips and does AI and builds networks. So we’re a lifecycle company. I think the one thing that
makes Qualcomm maybe different than some other
companies is I’ll say, I’ve been there a long time. We never invent just
a piece of paper. We always build a
prototype of it. So when our customers
come and say, well, how is this going to
work on this piece of hardware or this piece of
software, we’ve generally tried it at some point, even if
it’s not a commercial offering from us. And so you see that
the problem might start as a research problem
like in some sense TVM did. Or it might start as
a platform problem, like a customer has a problem. Or it might start
with, hey, we’re trying to solve a
problem with AI. And you can see more of
that if you follow up on YouTube the talks from the
Qualcomm Snapdragon Summit. In terms of products
to kind of orient it, Luis already talked about the
diversity of software products. And I think we’re a
little bit to blame. So you’ll see us working
to try to fix that. But the thing we
started out with was the neural processing SDK. It’s still the most popular
way to get onto Snapdragon. And frankly, there’s over a
billion AI enabled devices– a billion AI enabled devices
powered by Snapdragon. The second thing is Android
and an API from Google– Google’s partner. And then I’ll talk
about this in a second. And you’ll hear more about it
in the technical talk, Qualcomm Hexagon neural networks. And, of course, that
diagram on the right really illustrates
the fragmentation that you get when you
have unbounded innovation like we have right now. So let’s zoom in on
that diagram real quick. And then I will leave
the rest of the time for the next speaker. So here’s kind of the stack. It’s obviously simplified a
little bit, diverse hardware. And the question is,
OK, so what about– what about this piece right here? This is where we hand-tune
today, hand-build for the most part all of these libraries,
all of these operators, as you know, for neural networks. And specifically, you’ll hear
a little bit more about this later– Hexagon and AI library. So this is the kind of key
math library, if you will, that runs on top of Hexagon. And it’s highly tuned
for our DSP architecture. And it currently
supports about 100 hand-tuned,
highly-crafted operators. And very successful,
lots of handset manufacturers have launched
all kinds of devices with that stack. But what’s the problem? It’s handwritten. It’s optimized. There are three variations
of the instruction set. It goes into gazillions
of these kinds of chips, each with their own little
variation of TCM size and clock size and bus arrangement. It’s a lot of variables. And we’ve written
these extensions for the Vector Processor
and the Tensor Accelerator. So both for two and three, the
tensors, you have to write. And it turns out that they have
slightly different instruction sets. And so you have all this
inherent complexity. Incredible demand. Our customers, a
number one thing is, can I have a new operator? Like I would say the vast
majority of our feature requests run like, can
I have a new operator? And when can I have it? Like, tomorrow. So obviously, that’s a problem. And so the idea is Hexagon
is super power efficient. It’s super-flexible. But I’ll leave it
to the Hexagon guys in the room who are
here from Austin. It’s not the easiest
little machine to program. And so getting
the most out of it really takes kind
of specialty craft. And that’s why we’re
interested in TVM. So we see TVM as key
to access on Hexagon. And I got a question
over breakfast. Are you going to use
it for everything? How does it fit in? I’ll make just a
little shout-out to what we announced in Hawaii. We are announcing
user-defined operator. So this is the ability to
plug into our frameworks operators that are written by
our customers, by our partners. You’ll be able to do it
with OpenCL on the GPU. But notably, you’ll also be
able to do it on Hexagon. And so TVM is part
of that story where you’ll be able to use
TVM to write operators. There’ll be more details
later in the day. Key innovations. Just to kind of recap so you
[INAUDIBLE] key takeaways. We’re leader in AI silicon. Right here, I got
the leading chipsets in the world respectfully
for those of you in the room who are also
semiconductor manufacturers. Hardware, Hexagon hardware
is high-power performance. But to be plain, it’s not the
easiest machine to program. And that’s why we’re here. We’ve really had great success
with things like Halide before TVM. So domain-specific
languages for, say, computer vision like Halide
have been big standouts for us. And notably, big manufacturers
that you would be familiar with have used Halide with Hexagon
to do their CD processing. So we look forward to the same
sort of experience with TVM. And I didn’t talk a lot
about Qualcomm AI research. You’ll hear a little bit
more of a shout-out later. But a lot of the innovation
we’re doing with Qualcomm AI research is focused on
things like quantization, but also on things like
hardware-specific kernel optimizations, domain search,
and that kind of thing. And I’m super-excited
to hear what everybody else in the community
is doing in the same space. If you think there’s a chance
for further collaboration, I can definitely connect
you to the research guys. So with that, I will leave
it to the next speaker. LUIS CEZE: All right. [INAUDIBLE] [APPLAUSE] Oh. Thank you. So now that we
heard from Qualcomm, we’re going to hear from
Yida Wang from AWS who also just announced a new
hardware in the last few days. So, Yida, thank you. Thanks for coming. YIDA WANG: All right. Thank you, Luis. Thank you, Jeff. So I think their presentation
are really nice, and slides look very good, which
makes me kind of nervous because in the
next three slides, they are going to see pure text. So bear with me. So my name is Yida Wang. I’m from Amazon. I would like to share with
you about the story in Amazon about how the TVM with AWS. So before that, I
would like to give you a little bit setup about AWS
AI, so what we are doing. So basically, at
AWS, we are providing the broadest and most
complete set of machine learning capabilities to
different kind of customers. So no matter which domain
you are from, so for example, if you are just an
AI user, you are welcome to use our AI
services on the top. So you don’t need to know
much about the technology. You just need to use
it to recognize a face, to recognize as a text, or to
translate the text to speech, so on and so forth. And if you are
data scientist, you are welcome to use our
software suite called Amazon’s SageMaker, which
can help you churn a model, tune a model, and also do
auto-tuning and do faster inference, things like this. And lastly, if you are a
machine learning practitioner, you would like to build
blocks by yourself. We provide you all the
mainstream frameworks and infrastructures that you
can play with by yourself in whatever way you want. So as a result, we
are proud to say that more machine
learning happens on AWS than anywhere else. And more specifically,
you can see that 81% of the deep learning
in the cloud happen on AWS. So you can see that
this is very critical. It is very critical to us
to have good performance on deep learning and have a
high performance deep learning compiler will be
very important to us. So now, I would like to
talk to you about how TVM plays a role in AWS. So because we are
also here last year, so I would just like to
focus on mostly what’s happening in the past year. So first, TVM and AWS, we
do it as a cloud service. Last year, our engineering
manager, Vin Sharma, talked to you about
Amazon SageMaker Neo here. So our slogan is
train model once, and you can run everywhere
with up to 2X speedup. So the service is online. For a year now, we
have quite some users running their models– optimizing and
comparing their models for different kind of hardware. So next is the solution. So for SageMaker Neo,
you can treat it as more like a generic solution. We have give us the
model, give us a platform. We try to on the cloud
automatically optimize and compile it for you. Then if the solution
here means like more like a specific solution. So if you have a very specific
request, you can talk to us. We may be able to do
something for you. Like we can run a
number of models or a number of AWS
Amazon EC2 instance in the fastest performance. And also we do not limit
ourselves on the cloud. We also do things early age. So one example is like we run– we optimize the Alexa wake word
model on all the Amazon Echo smart speakers. So now, if you don’t
know now, now you would say that
like at home, when you say Alexa, if you give
you a response, actually, it’s powered by TVM. The third thing
is we also do not limit ourself inside Amazon. So we also actively talk to
a number of external device makers, including Qualcomm
to see if we can collaborate with people outside to do better
things with those two vendor makers. And also we do it as
a research project. That’s just like papers. We are scientists. Some of us are PhDs. So PhDs are trained
for writing papers. So if you do not
write papers, you feel that life is not complete. They are kidding. This is not a main reason. So the main reason here is
like different in compiler. It’s more like
uncharted territory. So a lot of things that
we would like to do, there is no existing solution. So in order to keep pushing
forward, we would have to– and we are pleased to do some
research to work on the things that we are interested in. And we would like to share
our results with the broader public. So then we write papers. And last, TVM is a compiler. We treat it as a compiler. So Luis just mentioned,
we launched AWS Inferentia powered EC2 instance
just a couple days ago. So for Inferentia,
we are using TVM to– we are building our
compiler on top of TVM. So this is AWS. How about the other way around? What AWS doing in
achieving community? So first, we are
very proud to be one of the major
contributors as a group. In the community, we joined
effort from the very beginning, so almost three years. And in the past year, we have
leading and collaborating in a few major features here. I just give you some name here. It might be incomplete, and I’m
not going to dive into details. But the point is
that we are working. And the last thing is like
besides the technology part, we are also doing
service in a community. We have two PMC members, eight
committers, and 14 reviewers. And I strongly believe
that the number is growing. Also we do activity
participation and leadership. In the community,
we host meetup, and we lead some
version release. We actively talk to people. We actually participate
in the discussion forum. And so we are very happy to
see the community growing. And I am also very happy
to see so many people today here in the room. So that’s it from my side. Thank you. [APPLAUSE] LUIS CEZE: So now that
we heard from hardware and from both companies,
bleeding heart and bleeding systems, we’re going to hear
about a new company, Jason. Jason Knight. JASON KNIGHT: Thanks, Luis. So, yeah, I’m here from OctoML. But before talking
about that, I want to start with a prediction. And as any good prediction
should be, it’s a bold one. So if N is the number of people
building machine learning models– and by building, I mean
training, deploying, et cetera; can be anything around
machine learning models– and M is the number of
software developers, then today, N is
much smaller than M. But I argue that it will be
and should be the case that N, it’ll be much larger than M
as we go into the future here. And why is that? One reason is that if you boil
it down to the fundamentals, building software
requires learning a rigid set of semantics
to describe computations for a computer, whereas
machine learning is a much more natural and flexible
way of teaching a system to do some task by showing it
examples or giving it a reward signal. And so why is this not the case
today that it seems simple? Like I give examples
to my kids or my dog. And it learns. And why everyone can do that. But why are so few people
training machine learning models? And the reason is that
deep learning deployment should be easy, but it’s not. It’s not today. It requires a large amount
of complexity and expertise. And we want everyone to be able
to handle and deploy and build these deep learning models. And so why is it complex? Why is there pain here? It’s across the stack– model ingestion, conversion
from one framework to another, software
fragmentation, performance estimation and comparison. How do I– what is it go
to perform on this device versus this device. I have to set up new
tools stacks, et cetera. There’s a Cartesian product
of models, frameworks, and hardware, and the
support across those. If you’re doing CodeGen
there’s a whole raft of options that I have
to learn and figure out what I want to do, et
cetera, et cetera, et cetera. And so coming back to this
vision of deep learning deployment should be
easy for everyone, TVM is core to
making that happen. It’s the core foundation
of techniques and compiler automation that
allows the building blocks of building this future. But it’s only the
first step to doing so. And so what are we doing about
it, this problem of where the complexity threshold
is too high today for more people to be
benefiting from deep learning, machine learning? The first thing is we’re
strengthening the core, so reinvesting and then
continuing to invest in the TVM open-source
project and ecosystem. And then I’m going to talk more
about that in the next slide. But this is the
robustness and resiliency of the project, accessibility
to new developers, the community and model coverage, et cetera. And OctoML is a company which– I’m going to get to– invests in TVM. And so you’ll see several
talks today by OctoML members. Tianqi Chen will be
talking about Unified IR and many other things. Jared will be talking
about dynamic execution. Logan, micro TVM. And then I’ll be talking
about OctoML as an investment in the ecosystem later today. And then also, Josh
in the back, has been working on transformer
improvements in the TVM ecosystem and Ziheng on
automatic quantization, which we’re not talking about today. But coming back to what
we’re doing about this, so in addition to
investing in the core, we decided that we
needed to form a company to continue building upon
the foundation of TVM. And so I’m going to
be talking about that more in terms of what the
first steps are later. So OctoML is a company
designed to make simple, secure, efficient
deployment of machine learning models more accessible
for anyone, everywhere. And we do that by driving
TVM ecosystem in community and adoption and by
expanding the set of users who can deploy machine
learning models by layering these services, automations,
and integrations on top. And so OctoML in the
Apache TVM ecosystem, kind of as Jeff Anita
mentioned earlier, it’s a synergistic
relationship between the two. And so this is us as
of today, the Octonauts we like to call ourselves. Yes, it is a children’s
show if you have kids, which is even cooler. And yes, so we’re all around. So feel free to grab one
of us and talk and chat. But come to the
presentation later today, I think 2:30-ish to learn
about our first product, the Octomizer. And a little teaser, it’s a
software as a service product offering around
TVM running across cloud, on-prem, et cetera. And then you’ll see more later. And feel free to reach
out and follow us on the various channels, and
reach out directly as well. So thanks. [APPLAUSE] LUIS CEZE: Thank you. Octonauts, yeah. All right. So now one more context switch. You heard me mentioning
broadening model coverage earlier today. And this is a lot to
do with the work that is going on the relay IR that
Zach Tatlock– hi, Zach– who not only being a great member of
the community, our evangelist, is also a human megaphone. Do you need a microphone? You probably don’t need– ZACHARY TATLOCK: I don’t
think we need a microphone. LUIS CEZE: So that’s right. All right. ZACHARY TATLOCK: OK. LUIS CEZE: All right. ZACHARY TATLOCK: Thanks. So I just wanted to
quickly talk about some of the things going on in
relay over the past year. To get context, hop in the
Wayback Machine and teleport back. And for those of you
who were here last year, you saw Jared Roesch, my
student there in the back, giving a talk about the
motivation, the need for relay. Now you can’t really
see in this slide. It’s a little bit blurry. So just to recap,
what’s going on is that increasingly,
state of the art models depend on more and more
programming features. They want data structures
like lists, trees, and graphs. They want control flow–
branches, recursion. And they want all of
that to work together. So they want whole-program
optimizations and analyses that allow each of these things
to contribute to one another. AUDIENCE: It’s just
because [INAUDIBLE].. ZACHARY TATLOCK:
Oh, we’re recording. LUIS CEZE: [INAUDIBLE]
recorded, yeah, yeah. ZACHARY TATLOCK: Oh, OK. I guess I’m not enough to
reach the entire internet. But, yeah. Yeah. And we see this in the
way that frameworks continue to evolve and change. So these features get added,
but sort of one at a time. And they often don’t work
smoothly with each other. So we have this challenge. We want to basically
have expressiveness, have everything work together,
and have performance. And the tension is that you
usually can’t have both. Normally, when you
get expressiveness, you lose some level
of performance. So we decided to take
a crack at it anyway, despite the fact that
we’re a small team of PhDs in a university setting. And thought, well, you know,
when I think performance, I think Haskell. So why don’t we start with
functional programming? Now the– I don’t know
if it means there’s not a lot of Haskell hackers here. But there’s a lot to be said
for taking this approach. So one thing is that you get
a nice, clean semantics, which does ease analysis
and optimization. You also get all of
these programming features that people want
to use to express data in their models. We have recursion. We have control flow. We have data types. And it’s easy to do
mathematical analysis for things like adding automatic
differentiation. So we built this. But, of course, it
wasn’t immediately clear that we’re going to get
the other side of the goal. Are we going to be able
to retain performance? And so I’m really
happy to report that after a year of
really intense engineering, the answer is yes. We haven’t lost anything. This is a low cost for adding
these programming features. Across a range of traditional,
vision-learning models, we match NNVM performance. So that’s great. Sort of achieve and unlocked. We wanted to have
both these things– have these high-level
programming features and retain the performance. And we did that. But there’s a
couple of questions that you might be
sitting with still. One is, how does that work? How do we have those
high-level programming features and still maintain the
excellent performance of NNVM? And the second is
kind of so what? We’re just matching what
NNVM can already do. So really quickly,
I want to give you some intuition about answers
to both those questions. But please do feel
free to come find me or any of the Relay students. Stephen’s back here as well. Jared’s wandering around. And I think, Logan and
others are around who– very excited to talk
about the project. So first, how do we
get that performance? So we have a few major tricks. One is that we have a really
fancy type system that allows us to write inference
code that takes advantage of shape and informs the
optimizer about memory layout– available memory
layout transformations. And so this fancy
type system actually helps basically form
an analysis that lets us do many, many
key optimizations. The second thing is
that we have really good partial evaluation– a whole
program partial evaluator which is able to
compile away most of the abstraction that isn’t
necessary for any given model. So you can write your
program at a high level. But the partial
evaluator is going to be able to remove the
overhead of fancier programming features that you don’t need
for this particular execution. But maybe the key thing– one of
things we’re most excited about and that you can be a part of– is that really offers
a very extensible, composable optimization
pass framework. So it’s easy to add new
optimization passes. And they compose really nicely. In fact, there are
Python bindings. So for many optimizations,
you can write them in one screen full of Python. And what’s neat is that these
optimizations work together. So as you can see,
no one optimization is responsible for performance
gains across all models. For some models, one
optimization is key. And other models,
it doesn’t matter. But because we have
this unified approach, these all work together
in a single framework. So this is how we
keep the performance. But so what? What’s new? Well, what’s exciting
is that because it’s easy to add these
transformations and because we have such
an expressive language, it’s easy to support
new kinds of models. So now we have support
for RNNs and LSTMs in TVM that are really difficult or
impossible to encode directly in NNVN but are natural in
Relay because we have recursion, because we have loops, because
we have all these features. And it’s not just adding
new kinds of models. Those same transformations
also make it easier to add support for new
kinds of back ends. I can do the sort of
high-level transformations necessary to massage my
model into a form that fits a particular
target or accelerator. And so we get a lot of benefit
from having these extensible easy to add optimizations. So oh, sorry, one thing I
actually want to highlight. Zhi actually issued
a PR recently. Yeah. And it’s not just actually
the optimizations. It’s also easy to
plug in a new cogent to target new accelerators like
this coming out from Qualcomm and other of our partners. So what’s the takeaway? Well, the high-level thing I
really want to emphasize today is that what I’ve been
talking about so far is sort of research
progress on Relay. But really, it’s not just
a research prototype. Relay is really become
production ready. And you know it’s true
because Tianqi said here in the upcoming release notes, Relay’s been merged
into mainline. It’s publicly available. It’s extensible. We hope it’s really easy
for you to add stuff– add your own passes, target
your own accelerators, add new back ends,
add new optimizations. We’ve put a lot of effort into
the tutorials and documentation to make that possible. So the big takeaway– the action
item is to not be a stranger. The Relay community, just like
the broader TVM community, is super-vibrant and
welcoming and helpful. And we would really love to
have more people involved writing Relay passes,
optimizations, and analyses. So hopefully, next year,
maybe somebody– one of you will be up here talking about
the amazing things you did in Relay over the past year. And we’re really
looking forward to that. Thank you. [APPLAUSE] LUIS CEZE: OK. Thank you. That was awesome. All right. And now, we are going to end
the keynote session with the one and only Tianqi Chen. Hi, Tianqi. It’s your turn. And I would say that a lot
of us are here probably because you started a
lot of these projects. So thank you. TIANQI CHEN: Thanks, Luis. LUIS CEZE: All right. TIANQI CHEN: Yeah. So I’m going to briefly
to talk about what we have done in the past and
will be up to in the next year. But, of course, all
these things wouldn’t happen without the help from
all of our community members. And to recap, in the current
deep learning framework, TVM mainly tries to automate
the hand-optimized version of the libraries. And why do we need
such automation? So if we look at traditional
deep learning frameworks, all those data flow graphs
need to offload those colored nodes onto low level libraries. Like in the case of
video, you do cuDNN. But the problem is that for
each of the colored node, you have to assign engineers
to go and optimize it. And there are definitely
more than three engineers. As a matter of fact, there
will be a huge team who go and optimize all the
each kind of operators that people care about. Now that if you
try new operators or if you want to do new
system optimizations like fuse the node together into
this new blue node, it will give you
better performance. But then people have to
write this new blue node for that operator. That means more
engineer resources. And if you try to count all
the possible combinatorial combinations from
operator fusion, you can compose it
simply does not scale– does not scale. And if you want to multiply that
by the amount of engineering resources do you
want to apply, it’s a very engineering
intensive process. That is why we want to use a
new approach, a machine learning based approach that
tries to replace many of the human effort that bridges
the high-level computational model by a machine learning
based program optimizer that automatically generate code to
optimize low-level hardware. So why do we need automation? There are several arguments. And today, I think it’s still
not very clear in the sense that they asked you
huge bunch of team’s going hand-optimized code. And for many big companies,
that is the way to go. But on the other hand,
after several years on trying to bring automation
to machine learning, we start to see the trend. The trend is that if you talk
about traditional models, like ResNet, all those models
that the engineers OKR. I’ll just go and
optimize those models. Usually those hand-written
library are very well optimize. And by using automation,
we can compete by maybe not win by a very large margin. Whenever we go and
talk to a customer, usually they will say that I
want to run a model on my car. And that model looks
different from ResNet. And of those emerging models,
because the engineers are not spending their effort
focusing on those target, machine learning can usually
bring you a huge win. And you will hear
more of the talks later today talking about
the unsuccessful stories about running TVM
on their model. You can try to do an exercise
to think about for every model people talk about, think about
is that an emerging model? Is that a model that MLPerf
or like the standard ML benchmark that include. And usually, you will
see there are a lot of interesting emerging models. Certainly, even on
standard benchmark models, we can still be
very competitive. The reason is that
it’s not like we are trying to compete with
human on the same problem because for human,
it’s usually– when we optimize
deep learning models, we will pick a specific
settings, like a specific data layout, specific operators. However, by using
machine learning, we opens up more
space [INAUDIBLE] optimization by trying different
fusing patterns, different data layout. And that gives us
another different angle for doing increment. And finally, by
using automation, it give us more
automation and help us to port the same machine
learning [INAUDIBLE] across all the hardware
back ends as many of you will hear about today. So to recap, the
TVM stack contains two-layer optimization. The high-level optimization,
that’s the Relay IR that Jack just talk about that help
us to do graph-level fusion, layout optimization, and Tensor
Expression level IR that help us to do low-level
decisions, like how do we– how do we do operator
fusion and make use of the shared [INAUDIBLE] GPUs. I’m going to switch gears. We’ll talk about a few
highlights from the last year that I see as a common
scene in the community. This is by no means
a comprehensive list. There are so many
things going on. So I’m just trying
to pick some of them. So the first trend
that we see is that there is a trend on
trying to get in more dynamism. So as specifically, when
we talk of deep learning a few years ago, we think about
static computational graphs that tries to– you can construct your
computation in a graph. And then you go and let your
model run this static graph. However, increasingly,
we start to see programs that have most constructions. Specifically, when
you are trying to do LLP or other
tasks, you might need to introduce
construction cycle loops or conditional control flows. In terms of a data
set, traditionally, in computer vision, we always
talk about a single tensor input with a known shape that
makes it easy for analysis. However, as we start to
move machine learning to more target, we
start to tackle– we start to see more kind
of data structures emerging like the passing trees, the
[INAUDIBLE],, and other data structure. That means that we need to
extend our round time compiler framework to be able to support
all these new constructions both in terms of the
program structure that we have as well
as the data structure that we support that
might go beyond tensors. Jerry is going to
talk more about today. But I’m going to try to give
you a sense of overview. One of the most exciting project
in the community this year is called Relay
Virtual Machine that tries to extend the
runtime environment to be able to tackle the programs
that contains the control flows, the other things. Really, virtual machines a
stack-based virtual machine that allows us to execute many
of the properties that we have. And another interesting thing
is that we have runtime data structures like ADT and tuples
that allows us to easily add new data structures [INAUDIBLE]
without having to change– without having to change
the system and compiler. The second trend
we are going to see is that as we are trying
to build machine learning onto many of the devices,
besides putting machine learning onto the
server devices, a we are also starting
to see our trend on trying to put machine
learning into a tiny devices. This can mean my mobile phones. But it can also mean something
like your thermometer or something that runs on
your camera or other cases. And one of the challenges of
getting the machine learning onto those tiny devices is
that many of those devices don’t necessarily have a robust
software support, not even the operators or
operation system support. Micro TVM, which Logan
will talk more about today, later, it’s a project that
tries to bridge that gap. And by providing bare-metal
support of TVM runtime that allows you to run
on any of the devices that support a J-TAG interface,
which literally means most of the devices out there. And the third highlight
of community improvement in the last year is a
better core infrastructure. So in TVM, we have been working
on the project for around three years now. And while these
core infrastructures are necessary to interact
with your compilation command, they are crucial, actually, in
bringing better performance. This includes things like
building better integer analysis, simplification
so that we don’t have to spend a lot of
time calculating the index when doing test
computation, which actually consumes a huge amount of time. And another very
recent introduction is a unified runtime
object protocol. So traditionally in
TVM, we have objects like AST nodes, NDArrays,
tensor, closure, and modules. And each of them have their
own runtime data structure. And we are going to unify them
into a single runtime object. The advantage of
this new protocol is that allows us to
expose those runtime object to many other languages directly
and allow us to directly access those objects
from Relay VM runtime so that in the future,
we can support more richer data types in our
machine learning workload. Last but not least, one of
the major design goal of TVM is always trying to be able
to support new specialized accelerators. And when we’re talking about
new specialized accelerators, one of the major
challenge we are facing is, how do we support the
increasing complex tensor computation instruction set as
the low accelerator provide. Traditionally, when
you write programs, you only have to write
scalar version of program, You write for
loops, describe how do you compute each element. That’s very flexible and easy. When you start to try to
write vectorized programs, such as those in Neo,
in ARM, or AVX-512, you have to structure
program a bit in order to make use
of those vector unit. Modern TPUs or
other accelerators have high definition or tensor
instruction that contains– that can be a matrix
vector– matrix, matrix– or like a high-dimensional
memory load with certain padding or
selection properties. It’s very challenging
to build system to support all those
emerging tensor instructions. That’s why we want to
introduce tensorization as a call primitive in TVM. The idea is that
not only we want to ask users to describe
compute specifications. We also want to use the same
IR to describe the hardware behavior of the hardware. And the idea of tensorization
is trying to mix and match these two things. And by able to allow the
high-level program to translate into the low-level
accelerated instruction that is supported
by those hardware. Tensorization not only
helps us to support new TP like [INAUDIBLE]. But it also help us to
support things like Volta GPU, where the Volta tensor called
[INAUDIBLE] is actually something that looks very
like the test [INAUDIBLE] instructions. So this is a new
result that [INAUDIBLE] will talk more about today. And as you might notice, there’s
on emerging workload like those that needing transformers,
we can get better performance than a native library
that Nvidia provide. Talking about [INAUDIBLE]
is one of the most important components of the
TVM stack is VTA, which Terry here worked on. And since last year,
we have been moving VTA to Chisel release,
which we called VTA 2.0. And there’s an exciting
new feature coming in. So nowadays, when we’re
talking about TVM, we talk about how do we optimize
machine learning workloads for the current power. But there’s an even more
interesting question to us, in a sense, given that
machine learning hardware is a fast, evolving field,
how do we go and optimize for our future hardwares that
is not even yet [INAUDIBLE] out. So in TSIM last year, we have
this new infrastructure– in the TVM last year, we have
this new infrastructure of TSIM that allows us to directly
take a hardware description language such as
Verilog or Chisel, going [INAUDIBLE] later
and directly connect that to the TVM runtime. That means that we can
do end-to-end hardware cosimulation and use all the
TVM to optimize your machine learning program even before
your hardware take off. And we believe this is
a very exciting future. And that’s something that
we are looking forward– the community– to collaborate
on and pushing forward. So far, I have been touched
on only a few points about the community’s major
focus in the last year. Let me talk a bit about
where we are going. Of course, it is by no means
a non-comparative topic. So I’m only talking
about something that I think that are
going to be important. So the first thing is
that as we are starting to see machine learning
deploying to more devices, we are starting to see a huge,
wide spectrum of runtimes and devices [INAUDIBLE] deploy. This might include being
able to interpret my runtime into certain NPU drivers, such
as a Hexagon or other ARMs NPU or HTTP other devices. It might also mean
integrating certain runtime into external runtimes
when user need them, for example, TensorRT or
[INAUDIBLE] TensorFlow runtime when some operator is
not supported by TVM. We want to try to
build a unified runtime interface in the
TVM runtime that allows us to expose all
those different runtimes in the same interface. The idea is that by subclassing
all those runtime modules into what we call
TVM runtime module and exposing a limited
set of functions, including a Get function that
gives you the Packed function protocol that TVM
support as well ask the binary solution
format allows us to be able to build a runtime
that allows us to package and export all those
machine learning runtimes in a single interface. So by using a unified
runtime, we’ll be able to have a
composed module exported to a single shell
library, load that from any of your front end,
including Python, JavaScript, Java, and Go, and also give
you automatic support for all the benefit that TVM
runtime gives you. So just RPC support for remote
profiling and auto tuning. The second thing that community
is actively pushing on is actually improving our IRs. So as you heard from
Zack today, so we have a IR effort called Relay
that’s a high-level functional IR. And historically, we also
inherited IR from Halide for a low-level
code optimizations. And that’s everyone
trying to rethink how we can codesign
the IR together to build a unified
IR structure that can benefit both the high-level
and low-level optimization. This still a very early stage. But at a high level,
what we want is we want a unified
module that can give us different functional variants. So a function could
be a Relay function that have a high level– that have a high
level [INAUDIBLE].. But their function is
also be able to call into a low level what
we call TE function that allows us to do low-level
scheduling optimizations. By having this unified IR,
it give us several benefit. First of all, it will simplify
the general compilation flow. So under a new flow,
there will be a– there will be a two major
functional variance. So we can just import
high-level models into our IR module that
are mainly Relay function. And then doing high-level
optimizations here, lower that to a
low-level function that could be a
TE function, which is a near low-level Tensor
IR, or external function that can be optimized by
external compiler. And we can do
autoschedule on that IR. And then CodeGen interface will
hand a middle level IR module onto a runtime
module that we will– that we can then go and deploy. There are certain
benefits on mixing different functional variants in
the same IR in the same module. For example, we could
define a Relay function that causing to those
low-level functions and still being able to
do some cooptimizations across the function boundary. One of the major thing that I
actually want to investigate on is actually improving
the accessibility of TVM. This is one of
lessons we learned from the major
different frameworks, like PyTorch, in a sense that
by enabling more people to be able to use machine
learning frameworks– and in this case, machine
and compiler framework– we’ll be able to increase
the rate of innovation. The reason is that the area
of machine learning, compiler research is still
wide open field. And we believe that the
most important thing is to trying to improve– trying to allow more
people to innovate faster. That’s why I want to bring
first-class Python support to our new IR– unified IR infrastructure. The idea is that for
every IR that you can express in a TVM will allow
you to express a Python that I call hybrid script that
can directly increase [INAUDIBLE] as a secondary tag
text format for the Python IR. And you will be able to directly
write passes and manipulate IR data structures in Python. The reason is that that will
allow us to easily accelerate innovations, including
interrogating your favorite machine
learning method for doing AutoScheduling. And by no means we want
to stick to Python. So when your method
is product-ready, we allow you to easily shift
to C++ and push a button to be able to build a more robust
compiler that can be reused across the languages. One of the major
thing we want to do is we want to rethink the
low-level IR structure. There are several
things we want to do, including transforming the
current way of transformation to use low-level function
as a basic element of transformations and making
transformations– schedule transformations as
optimization passes and bringing better
tensorization support. Another thing that we
want to be able to enable is trying to think
about how we can interpolate with other
open-source ecosystem. TVM is open-source project. And one of the most important
things of open-source project is we can try to leverage what
existing open-source ecosystem brings us. This includes trying
to be able to make use of existing ecosystem
like LVM and trying to bring– trying to build
interpolations from existing IRs into our version
of IR, including things like TorchScript or
like MLIR TensorFlow variant. And we will also try to
include an external function packaging [INAUDIBLE] that
allows us to introduce external IRs as a component
in the TVM IR module so that allows us to do better
customized packaging and better customized CodeGen that
allows us to combine those modules together. Last but not least,
automation is always a fundamental piece of TVM. And we always want to try
to improve this component. Currently, most of
a TVM automation focus on the low-level IR
optimization [INAUDIBLE] all of TVM. We want to enlarge that
into across the stack by enabling automation, not only
for the low-level IR but also for the back end
component as well as the high-level optimization
such as automatic quantization layout and other things and
combine this optimization with the low-level optimization. These are a lot of things
that we want to work on. And we are going to
spend a lot of effort collectively as a
community to work on these points in coming year. Of course, this is by no
means a comprehensive list. And there are a lot of
other exciting things that computer committee
is going to work on. And many of you are
going to hear many of them giving talks today. So finally, I want to give a few
highlights about the community growth. So TVM, we have been incubating
as Apache TVM for around a year now. And as Luis said, one of
the most important things about Apache is this open,
independent governance that not only allows
us to open source code and share the source code. It also means that we want
to open development, which means that all the
development activities in TVM are public available. The RCs and discussions are
available in the discuss forum and mail list so that anyone
who want to join the development can jump into the
community and collaborate. And finally, and
most important thing that brings by a project
committee is open governance. That means that whoever
contribute heavily into TVM will be granted
as a commitership and be able jointly lead
the committee together. You will hear about many talks
from our committers and TMC members today. And we’re really looking
forward to welcoming you as a part of community. And in terms of
growth, we’ve already mentioned that we are getting
70% growth since last TVM conference, that hopefully
we can keep that growth rate. And every month,
there are more than 50 also contributing over 140 PRs. And there are more
than 1,000 forum posts. So I would say like the
community is really growing. And it’s really an
exciting time to join. So with that, I’ll
pass token to Luis to introduce you to the
last part of the program. Thank you. LUIS CEZE: All right. Thank you. [APPLAUSE] All right. Well, with that, I’d love
to thank the SAMPL sponsors. Like a lot of the work that
is done in the research side at UW is being generally
supported by several both from– sponsors from
academia and industry. And notice that there’s some
space in this [INAUDIBLE] interested in sponsoring us. You know, you can
fit more logos there. But all right. With that, let me just tell
you how the rest of the day is going to go. So you’re going to have now two
20 minute talks on going deeper on how TVM is being used
at AWS from Yida and Zhi and then TVM at Facebook by
Andrew Tulloch and Bram Wasti. And then we’re going to
have a break before we talk about compilers in the end. With that, I’m going to
context switch to Yida and Zhi as the next speakers. So thank you. But in case, just a
couple of logistics. There’s bathrooms in
that corner, right there. Also there is water
bottles there. There’s fountains that
make it easy for you to reuse your water bottle. And also if you
were speaking, make sure you use a microphone so
our friends on YouTube streaming and recording would
actually capture the audio. All right. Cool. Let me just go back here. [INTERPOSING VOICES] PRESENTER: Just give people– I mean the setup. [INAUDIBLE] I just give people,
I mean, the setup. [INTERPOSING VOICES] LUIS CEZE: OK. So [INAUDIBLE]
Yida back on stage to go deeper on TVM at AWS. So thank you. YIDA WANG: Thank you. Hello, everyone again. So I’m Yida. So I would like to discuss with
you about the deep learning compiler project at AWS
together with my colleague Zhi. So first, let me bring the first
slide basically of my keynote back with some more details. So this is a reduced snapshot
of what we are doing at AWS AI. So as I told you that
there are three levels. And I give you– I just try to expand
it a little bit. But this is still reduced
snapshot because due to illustrate– for
the sake of space, if you are interested
in the complete picture, please refer to our CEO, Andy
Jassy’s keynote in re:Invent delivered earlier this week. So as you can tell
from this snapshot, performance is really important. So imagine that if we are
able to use some deep learning compiler, like in this case,
we choose TVM to optimize the performance from the
bottom, like different kind of hardwares for different
kind of frameworks, then consequently, we’ll be able
to do better on a second level and ultimately, also better on
the third level in AI services. So different in compiler
is really important to us. As I just said, the AWS choice
of different compiler is TVM. And because this is a
TVM developer conference, I’m going to skip all the
introduction about what’s TVM. And I’m going to
overwhelm you by this. So this is basically something
that we are doing at AWS. This might not be
the complete list. I tried to summarize the things. So you can see that we do
from training to inference. We do from [INAUDIBLE]
to sparsity. We do all kinds optimizations
from graph to tensor, so on and so forth. And we also do accelerator. And on top of those– all the components, we are
also building a service called a SageMaker Neo. And we run the things
like using the– we run the service
using all the techniques I have provided to you. And I remember that
last year in my talk, I kicked off my talk
with a slide of portraits of all our team members. Today, I’m not able to
do that because we do not have enough space. So meaning that our
team is growing, and also it just a spoiler. So the very last line of
the– the very last item of the very last line, we are
going to say, we are hiring. So if you are interested,
come to talk to us. The last thing about this slide
is that the CD color coding. So the blue color here
means that we would cover in this tech session. And the orange
color means that we have colleagues who will
cover in the rest of today. And the red ones is
basically we do not discuss in the presentation. But feel free to come to us if
you are interested in anything. So the first part I would like
to talk about is QNN dialect. So this is mostly done by
my colleague, Animesh Jain. Unfortunately, he will not
be able to make it today because he went back to
India for getting married. So he managed to convince
me that his wedding ceremony is a little
bit important– more important than this. So now I come to present
on behalf of him. So first thing that I
want to put here upfront is like we are talking about
consuming and processing a prequantized model,
meaning that this model is already quantized. We just want to use TVM to
consume it, to compile it, to run it. So high-level system design. There are two options
that we can do. One is we can just continue to
add new operators from scratch. This might not be good
from a system perspective because you are reinventing
a lot of things from scratch, and you are not going
to reuse anything that is already provided by
the amazing TVM software stack. So we go with the
second option, which means that we will try to
define a dialect called QNN or Quantized Neural Network
to encapsulate the work that we are doing. And then we can
try to reuse the– all that Relay passes
and the low-level optimizations that the
TVM stack already provide. So this is much better than
you reinvent everything from scratch. So what is the QNN dialect? It basically is a
set of operators that you can easily
correspond to many framework quantized operators
that quantized like Conv2D so on and so forth. And in QNN dialect, we would
lower those QNN operators into a number of Relay
operators so then we can– after that, we can enjoy
all the benefits provided by the original TVM stack. So just give you
some quick example. Like this is a QNN quantized. So given the data type– given a zero point,
I am not going to talk about anything
about quantization. Here, I just assuming that you
are somewhat familiar with what quantization is. So given a data type,
given the zero point, given a scale, we able to– we then would define this
QNN quantized operator and lower it into a bunch
of like divide, round, cast this– like vanilla
Relay operator so that we’ll be able to process. And another example
is the Conv2D. So if you give us all
the direct output data type, also the zero point
and the kernel zero point information, we can
lower this QNN operator into a bunch of operators again. And note that this is an
asymmetric quantization. If you have a 0.0, then all the
operators after this nn.Conv2D could be eliminated. So this is basically the
idea of doing a QNN dialect. So we have some
preliminary results to do that as a
proof of concept. We consume a bunch of TFLite
models in the TVM and round it. So you can see, we get some
reasonable speedup here. The metric to pay attention is
like we do batch size equals 1. And also, I do not put it here. But I just want to
guarantee you that we get comparable accuracy than
the floating point 32 models. And if you– this
is also, again, an asymmetric quantization. If you do it in a symmetric way,
we can get much better speedup as I just showed you that
a lot of computations could be in a minute. So this is the QNN
dielect so far. How can we make it better? So if you look
into this workflow that what we want
to do is we want consume a prequantized
model using QNN dialect and do some specific QNN pass
and then use the existing Relay and TVM workflow to run it. So the component that
we say green check is something that we
have already done. And the yellow
one, MXNet, modeled to QNN dielect something
that work in progress. And you will see that a lot of
other things is not done here. For example, like
a lot of layout related optimization
that we haven’t done which means
that if we can do it, we can get even better
performance speedup showed on the last slide. And also we only run
it on the x86 CPUs. So how about other
devices, platforms? That would be also important
and interesting to do. So if you’re
interested, we are very eager to work with you
on those components that is not done yet. And together, we
can make it better. So this is about
the QNN dialect. The next thing I would
like to share with you is about two new
Amazon EC2 instances that we just launched
this week and then how we use TVM to make it better. So this is a work done by
Hongbin, Yizhi, and Haichen. Like three of them are, I
think, in this room and a lot of people from the labs. So this is a company
that Israel chip company. Annapuma Labs is an
Israel chip company that we acquired
like a few years ago. So the first instance that
I will like to talk about is the Amazon EC2 Inf1 instance. So this is powered by AWS
Inferentia AWS Inferentia is an in-house chip for machine
learning inference that we built in the past few years. So using this
instance, we will be able to enjoy lower
latency higher throughput, and lower cost performance
compared to another instance called a G4, which is powered
by Nvidia Tesla T4 GPU. So in this instance, we
can get up to 2,000 teraops a sub-millisecond latency. And this instance,
it’s integrated with all popular
machine learning frameworks like TensorFlow,
like PyTorch, and like MXNet. So anyone want to guess
why we can integrate with those frameworks? It’s through Relay. So this is about the
instance itself and then the chip– a little bit
about the Inferentia chip. So first, I’m not going to
dive into too much details. But feel free to come to me
if you have questions or are interested in knowing more. So first, we have four
NeuroCores per chip. So because of
this, we need to do a lot of customized
tensor-level optimization to make full use of this chip. Secondly, in the chip, we have
two-stage memory hierarchy. We have large on-chip cache. And we also have a commodity– we also have commodity DRAM. So then, obviously, we need to
very carefully and proactively manage the data movement between
different kind of memories to enjoy the provided
compute power. And third, we have a fast
chip-to-chip interconnect between chips. Like we use some specialized
communication protocol. So because of this, we are able
to do some model parallelism in pipeline. So it’s basically some kind
of pipeline parallelism so that we can
increase the throughput of the machine
learning inference without losing much latency. So this is the first instance–
the EC2 instance that I would like to share with you. The second one we called
it Amazon EC2 M6g, R6g, and C6g instances. This is powered
by ARM-based CPUs. It’s also in-house ARM-based
GPUs called AWS Graviton2. So as the name suggests,
there’s a AWS Graviton processor that we announced last year. So compared to that, we
get much better performance in terms of compute power,
the memory capacity, and the overall performance. And also compared to the
current x86-based instances, we get better price
per performance in this price advantage. And this Graviton2 processor
is a general purpose CPU. So it run a lot of general
purpose benchmark on top of it like stack, like memcached
to show that it’s powerful. But we also want to run
machine learning inference on top of it. So the first thing
I would like to say is like in a
general purpose CPU, it’s also capable of
doing machine learning or deep learning inference. And actually, a lot
of inference jobs happening on the different–
on a general purpose CPUs. And in our ATC
’19 paper, we show that you are able to run machine
learning like [INAUDIBLE] model inferences very efficiently on
these kind of general purpose CPU. So compared to M5, which
is powered by x86 CPUs, M6g actually does faster
machine model inference with lower price. Note that this is absolute
like better performance, not performance and not
price per performance. So in order to do
this, we are writing– we are using TVM
to lower the model all the way to use the Neo– ARM-Neo instruction. And we actually always
go to the assembly level to look into the assembly
code of those computationally intensive kernel to make
sure that we are fully utilizing the computation
power provided by this CPU. So this is about the
ARM-based new instance that we just launched. So the last thing I
would like to talk about before handing over
to my colleague Zhi, is about the tutorial. So just now, Luis say that we
spend a lot of time this year to improve the tutorial. And I was also
part of the effort. I was in FCIC tutorial as well. But unfortunately, we still
think that it’s not enough. Why is that? So check this out. So this is a
typical conversation that happened to us and
maybe to you as well. So we have customers,
users, or new hires who are new to this field. And they come to us,
ask us questions, something like how to
use TVM to do something. TVM is very powerful. So it can do this,
do that, do that. But a lot of things we
can do by using TVM. But how to do that? So they come to us. So we are very passionate
about some people would like to use TVM. So [INAUDIBLE],, you
can do the following. Step one, step two, step three
is never going to be easy. So then the customer use
them and say, well, cool, meaning that it’s
too complicated. I cannot remember it. So is there any tutorial? Of course, we do
have tutorial, right? So I said, yeah. Check this out. And check this for this. Check that for that because
it allow us– tutorial is kind of a fragmented. Like it’s everywhere. So just do this and this
kind of like conversation you would just– you can iterate. And eventually, they may
come to us, say, I failed. TVM is really, really
hard to get started, to use, to deploy, [INAUDIBLE]. They are frustrated. We are frustrated. And also the other
evidence that I just try to collect a little bit
from the discussion forum is like you see a
lot of thread talking about some easy stuff
like how to start to use Relay confused
with some very easy like entry-level function. And there’s no speedup. And is there any beginner
guide to contribute? Something like this. Which makes us
think maybe we need to come up with some
more systematic tutorial. So luckily, we have some this
kind of experience before. So my colleagues Alex,
Aston, Zack, and Mu. So four of them, like
in the past few years, they wrote a book. It’s open-source book called
Dive Into Deep Learning. So this book is an interactive
deep learning book with code, with [INAUDIBLE] and discussion
so that you can run it. It’s totally open on GitHub. You can run it only on your
laptop in the Jupyter network notebook fashion. And it’s represented
in multiple languages. And in Chinese,
there’s already a book. It’s number one
bestseller online. And the English version is also
available during re:Invent. And it’s going to be–
if you guys happen to go to Vancouver,
[INAUDIBLE] next week, it’s going to be
available there as well. So this book is very successful. You have so many stars. And also we get good adoption,
bunch of top universities. We just list a few names here
like those [INAUDIBLE] maps. They are using this book to
teach the deep learning course. So this is what we have done. And we are thinking about
if we can extend this part– this kind of a work to
the different compilers [INAUDIBLE]. So here is the deep
learning D2L compiler effort that we are putting together. But it’s very preliminary. Let me show you something. So what we want to do
is we want to provide a systematic tutorial for the
beginners who want to use TVM. I highlight “use” here. So it’s not for you. It’s not for the
people who would like to be a developer of TVM. It just for people
who want to use TVM. And more broadly,
we want to make it as like to those
people in general who would like to take the deep
learning compiler 101 class. So this is the goal. And inherited from
the D2L book, we would like to make it as
a Python notebook fashion so it’s runnable on your
laptop and, in this case, maybe Colab because
we need to use TVM– we need to use GPU to show
something or maybe even other platform. So you want to run it online. And obviously, we will make it
available on the EC2 instances that you can use. And a very preliminary
0.1 is already released. There are 22 sections just
covering very basic stuff from getting started, how to
install TVM, how to do some– get your hand dirty to
do some very quick thing in some basic
operator-level optimization. So it’s far, far away
from [INAUDIBLE].. And so we call for contributors. So the point here is
like we are hoping that if anyone are interested
in doing these kind of things, please join us. It’s going to be open source. And you are not going
to get money from it. But you are going to
be able to help people. But heads up here is like
based on the experience that we had for D2L, this,
if you commit to it, this is going to be
very time consuming. But if you’re
interested, just come to talk to me like any point. So with that, I would hand
over to my colleague Zhi. Thank you. [APPLAUSE] LUIS CEZE: Does
anybody have questions before we go to [INAUDIBLE]? Questions? YIDA WANG: Maybe instead
of a question, [INAUDIBLE].. LUIS CEZE: Sounds good. ZHI CHEN: OK. My name is Zhi. I’m going to talk
something more later. As Yida just covered
some TVM-related work that we’ve been doing at Amazon. And I’m going to
talk about like Neo– the high level picture of Neo. And also I’m going to talk
like two more projects or like works we’ve been
doing at Amazon for TVM. And one of them is still
actively working on. And the other one
has been merged. So let’s first talk
about SageMaker Neo. So the goal of Neo is to
like you make customers do model training once. And then we can run it anywhere. So Neo currently takes
different inputs. Like we can tag
algorithms from SageMaker. And these algorithms are
like some traditional machine learning algorithms. And then we can also take
some models portrayed from popular frameworks. And also we can take
the models from XGBoost. And then SageMaker Neo
will take these inputs. And then we do compilation
and also optimizations. After the optimizations done,
we can then deploy to models– compile models on
various hardware targets like ARM, Cadence,
Intel, Nvidia, RISC-V, and [INAUDIBLE]. As [INAUDIBLE] TVM can
do most of this work as well because sitting in
the heart of Neo is TVM. And another good thing
that for SageMaker Neo, we enable users to have a very
quick one-click compilation so that they can
help– so that we can help users to simplify the
efforts for model compilation and also deployment. So this is the compilation
flow for Neo service. A customer comes
to Neo, and they can- have the– they can open
the SageMaker console page. Then they want to
bring their own models. These models could be
trained by SageMaker as well. And also they want to have some
combination configurations. Like you want to compile
it for a certain target, or you want to compile a certain
model with a certain input. And also something
like you want to– like after the
compilation, you want to store your model
somewhere in S3 bucket. So with these
configurations, then you can start to click on
the compilation bottom and then just to initialize
the compilation job. There after the
compilation is done, then you must have
the artifacts for– like for TVM, we’ll
have three artifacts. One is SO file. And one is like the primes file. And also another one is a JSON. And we take these three files. Of course, for other
objects or for other models like [INAUDIBLE],, we may
have different artifacts. Now we save them
somewhere into S3 buckets as you specified before. And later on, you’re like– the customers can directly
download these compiled models. And they can take them to
deploy by themselves anywhere on the specified target. Also you can also
launch some end points to initiate the [INAUDIBLE]
image for inference in the cloud. So this is the major
compilation flow. Of course, there are
some– also a lot of theories related to work
in and some EC2 instance. So we’re not going to
talk about it here. And also SageMaker
Neo [INAUDIBLE] TVM. And we also contribute our
code back to the community. And so Neo is an
open-source project that is designed to
be much multivendor project with every participating
organization grant ownership in the project’s governance. We look forward to work
with device vendors also hardware processors
and also academic labs to have them to bring their
own compilation [INAUDIBLE] also the runtime into TVM. Sitting back in the TVM,
there’s like a Deep Learning Runtime which is called DLR
which is also open source. And it has a bunch of
runtimes [INAUDIBLE].. Like we have TVM runtime. We have Treelite for XGBoost. Also we have some other
runtime like TFLite so that we can have more
better model coverage and also model coverage. So like we are also
talking to other vendors or accelerator makers to bring
their own runtime to the DLR part so that we can
have a Neo [INAUDIBLE] runtime to initialize the two–
like to better model coverage. So there’s another way that
you can bring your runtime to TVM– to like Neo. This is something we encourage. So we encourage the users
or the hardware vendors to bringing their own
CodeGen tool to TVM so that we can have a
compiler on the TVM stack. And then we can have the
TVM to generate artifacts that could be compatible
to the TVM runtime so that we can initialize
it and also launch it later, which is pretty convenient for
the users and the customers. That brings it to this topic–
bring your own CodeGen to TVM. This work is been still
being actively working on– Cody and I are working
on it and Jared also have the prototype in the
very beginning stages. So let’s take a look
at why we want this. So suppose you are a vendor, and
you are making some processors or accelerators. And these accelerators,
they work very well on some popular models
or popular operators, like Comp2D, like MaxPool,
ReLU, this kind of operators. But what they– even
for these operators, this limited
operators you may have covered a large body of models
from image classification models. ResNet-50 just kind of this. But as more– the more
models emerging every day. And you probably wonder, you
were like have some models that you cannot support because you
have some operators you cannot support by your hardware. Like if you come
to SSD models, you may have some operators like
non-maximum suppression, just operators that
have control flow. Like they have some complicated
control logic in there. So this is something you
cannot support by your device. But you can do something
very expensively. Like you can redesign
your hardware. Or you can have some special
handling for just operators to make it work. But these are very costly. To this end, TVM is going
to be your good friend because we can ask TVM to
do some fancy work for you. So you can ask TVM to
make your whole graph or like a Relay program
into some segments. So some of them, you can
just upload to your hardware or your device. And you can leave TVM to do code
generation for the part that you cannot do. And so intrinsically, so you
will have TVM to take care of the operators who
cannot support or the part of the network you
cannot support. And you can have the backbone
network like ResNet-50 still run on your device. So this is the high-level
idea why we need this. And then let’s
take a look at how TVM would look like
after we integrating this thing to the TVM stack. So on the top of this
graph, you will see we still have the Relay program. And then we still have
some original programs that convert the
from some hardware– like some puzzles or like
some front-end frameworks. And later on, instead of
running some optimizations or compilations
directly, we are going to have an extra step, which
is to annotate your program and also partition it
into different segments. These two passes could
be transparent to users. Like they can just
provide a list of operators using a template
we’re going to provide. And also users can do some
also more complicated work. Like they can annotate
the program by themselves, which by writing some
passes to do this. And this is a little
more complicated than– but anyways, then by doing this,
you have a partition program. And then you can send
this partition program for optimization
and also compilation as the usual process
would do in TVM. And then later on,
we ended compilation. Or the optimization, we have
to invoke different to CodeGen tours. So previously, we only have– our VM could not discount
the [INAUDIBLE] in TVM. Now you may bring your own
[INAUDIBLE] at this layer. And then we can largely
reuse some kind of TVM code to do serialization
and this job for you. So it also easy, your work here. And then after
this CodeGen style, we can take this blob
together and send it to TVM runtime for execution. Sometimes you may need your
own runtime like for TensorRT or for DNNL that may
have execution engine. But we can make them
compatible somehow to TVM. Like we can invoke it
directly from the TVM runtime. Well, sometimes don’t even
need to have a runtime engine. And you can probably
just have a DSO or like a C wrapper, which could
be linked or compiled together with TVM runtime. So that’s very convenient. So that’s what you are
going to do essentially. So because we are going
to take care of front end part of the partition. It’s annotation for you. So the most of the
work you want to do is to have your CodeGen tools
and runtime integrate into TVM. So depends on the– depends on the application
or your use case. You may have to– so sometimes if you
have the engine, just as I mentioned
before, you need to generate like an artifact
that can be lowered or saved through TVM runtime module. And then you can– and we can later on
invoke it directly. Or if you don’t
have an engine, you can just produce like
a C source code, which is a wrapper for your API. And then these are
compatible to TVM CodeGen. So the good thing
here is that you don’t need to dispose your
IP directly to anything. You just need to provide some
API centered wrappers to access it and also to invoke it. And for the runtime,
most likely we can reuse most of the
TVM runtime logic. And you need to customize
some kind of runtime. Like if you have
a JSON file, you want to have it interpret
for the JSON, which is pretty simple to write. And it’s also pretty thin layer. And then you can
invoke it directly through packed functions. So that’s the other work
we’ve been doing in TVM, which is the pass manager. We know that as
Relay is featured with a bunch of– not even
bunch, probably [INAUDIBLE] of optimization passes. And how to orchestrate
these passes and systematically
manage them is extra challenging because
otherwise, a pass developer would have to
spend a lot of time to figure out how to apply
them correctly or something like that. So this motivates us to
have a systematic way to manage these passes in
TVM or in Relay somehow. And this could be even
extended to later on if we have unified IR for the we TVM IR. So we take some advantages
from traditional compilers and also deep
learning frameworks. In the traditional
compiler, like RVM, it has also a pass manager. The job of the pass manager
is to do the following work. Like we can have this
pass manager to– it makes the pass developers
life a little bit easier ’cause the pas writer
is going to just focus on the skeleton of this pass. And they can rely on the
pass manager resolve the– like to apply on each
granularity of the program. And also the pass
manager is used to maintain some
pass information like the dependencies
between passes so that the dependency could
be automatically resolved. And it also keeps some
information up-to-date so that you don’t need to reapply
some passes if you don’t change the program states. And in the deep
learning framework, we also have some similar ideas
like for PyTorch and Keras. They have sequential. And for Gluon, they have Block. It use to have the developers
to stack a bunch of layers together. And then you can use a bunch of
them to build up a large model. So this allows a very
flexible, customized pipeline. So we take ideas from here. And also we want to
build it in Relay. I say pass managers
so that we can have the advantages of both
the traditional compiler and also the deep learning
framework like constructs like Sequential. And this is how it works here. So we have a Relay program which
is converted from front end parsers. And then we have a
compilation flow here. Then we take this program
for optimization and use pass manager. So depending on what
you want to optimize, there might be some different
granularity you can play on. So the first one is
like a module pass. It means you want to write
some optimizations that probably need some
global information from the whole module. Like you want to add
a program or like you want to delete the
program into module. For example, we have some passes
like Lambda lifting or inline. This kind of
passes, you may need the whole global information. Or you probably just
want to add or remove some expressions
in the function. In this case, you just
need the local information in the program– in the function. And then you can rely
on this– you can just have a scattering of this pass. And you can rely
on the pass manager to apply it for each
pass in the whole module. So some of the examples are
like fusion, cost funding, [INAUDIBLE]. And built upon these two passes
or these two types of passes, we can have sequential
passes, which is going help you to apply
on a sequence of passes. And this is going to
help you to optimize your own customized pipeline. And we can have an example here. yes. And with these
optimized passes, you can then apply the
optimized program and take it to
the Relay runtime. So this is the example. You can have different passes. And you can use different
configuration file– configurations to apply for it. And you will have
the optimized model. And we also have a tutorial
here for multitask. Please go to that one. So the following are
the targets for the rest of the– from Amazon. So Haichen is going to talk
about some dynamic execution and the virtual machine. And Yao is going to talk
about dynamic models with graph dispatching. And also, Cody is going
to share some ideas about improving AutoTVM
efficiency by schedule sharing. And Yuwei is going to talk
about optimized sparse and graph kernels through TVM. And there’s some takeaways. So first, industry needs an
open standard compiler for DL. And Amazon is actively
working on TVM stack for this. And we are eager to
collaborate with the community. And you can talk to us
and grab some of us. They’re sitting over there. We have around 10 people
joining here and probably over here as well. And last, but not the
least, we’re hiring, again. Yeah. Please just write to Yida,
me, or managers, Vin. OK. What the [INAUDIBLE] outline. LUIS CEZE: Thank you. ZHI CHEN: With
that, [INAUDIBLE].. [APPLAUSE] LUIS CEZE: Do you have any
questions for Yida or– Zhi? [INAUDIBLE] questions? No? OK. All right. So [INAUDIBLE] see
somebody [INAUDIBLE] after. You see have two more– one
more minute to ask questions if you [INAUDIBLE] a
context switch here. ZHI CHEN: Or you can just
come to us later [INAUDIBLE].. LUIS CEZE: OK. All right. Thank you guys again. All right. Cool. [APPLAUSE] All right. Now we had everything yellow
with AWS, everything blue with Facebook. Thank you for coming,
Andrew and Bram. BRAM WASTI: Wonderful. LUIS CEZE: This is
the new description. So [INAUDIBLE]. Everything OK there? BRAM WASTI: It looks good. LUIS CEZE: All right. [INAUDIBLE] here. And make sure you guys use the
microphone so we can– it gets recorded. BRAM WASTI: OK. OK. LUIS CEZE: All good? ANDREW TULLOCH: This
one [INAUDIBLE].. LUIS CEZE: Both
of them are wired. Thank you guys
for coming, again. All right. Excellent. ANDREW TULLOCH: Hi everyone. It’s really great to be here. As a lot of the other presenters
have said, in the past year, there’s just been this huge
growth in the TVM community. And it’s really exciting. So at Facebook, we’ve
been contributing to TVM for the past
year and a half or so. And it’s just been a
really awesome experience. So I’m presenting here
with my colleague, Bram. And I’m going to talk
about TVM for machine learning systems at
Facebook and talk about an end-to-end example
of how TVM and model system codesign was really helpful in
shipping one of our new speech synthesis models. Bram’s going to talk
about sparsity and modern deep learning models
and talk about PyTorch, [INAUDIBLE],, and the TVM
integration we’ve done there. Later this afternoon, Ansha
and Hao, my other colleagues from Facebook, are going
to talk about applying TVM in our ads ranking stack. So with that, I want to give
some context for how at least I and a couple others think
about TVM for machine learning systems. So broadly speaking,
five years ago, when I started working
on deep learning, we were using these tools like
Piano or Caffe or Lua Torch. These systems were a
lot simpler and probably a lot easier to use, I think,
or at least a lot easier to implement. We mostly cared about these
old Tesla GPUs, these Kepler generation. We had fairly static and limited
control flow and computational patterns. And performance was pretty far
from the roofline as the jumps that we got from
[INAUDIBLE] and [INAUDIBLE] and so on on existing
hardware [INAUDIBLE].. There are lots of easy
Pareto efficient improvements we could make to our
models and our systems. But the world looks
really different today. We have so many more
applications and so many more domains we care about. The low-hanging fruits
in our compiler stacks have been increasingly
been picked. We have so many more targets. We just heard about a 2
petaflop system from AWS. And we’re also hearing about and
using these very low power 100 megaflop and 10 megaflop
embedded devices. And so as our applications
become even more pervasive and real-time– everything
from CV to ASR to MT to TTS and so on– our models are also
becoming far more demanding of our frameworks. Through tools like
architecture search and so on, we really directly optimize
for the machine roofline in a lot of cases. And so achieving roofline for
a wide diversity of models matters a lot more. And so that’s why I think
a really key feature of TVM in the compiler stack– echoing what we’ve
heard before– that achieving peak performance
across a whole range of architectures is
arguably impossible to do without this
compiler technology. So at Facebook
specifically, we still see it as a huge demand for
flexibility, portability, and performance. So taking PyTorch
specifically, one of the main framework
we use of Facebook, it’s far less
constrained and more expressive than previous
frameworks, which is awesome from a
developer experience but introduces a lot more
challenges on the ML back end side. So we have conceptually between
500 and 1,000 tensor operations in the PyTorch
Tensor library which makes porting it
to new back ends and enabling new hardware
even more challenging. There’s a huge set of
programs that users construct and a huge
range of platforms that we want to deploy on. And achieving close to
roofline performance for this diversity of
hardware is a real challenge. So I want to go through an
end-to-end example from earlier this year of a production–
a real-world production application where TVM was
a really important part in the whole stack. So I think this helps
demonstrate a bunch of– a nice application
of the TVM stack. Might convince some of
the haters and skeptics in the audience. But it also touches
on some tricks, like sparsity and
so on, which Bram’s going to go into more detail. The domain is speech
synthesis or TTS. So in the last few years,
since [INAUDIBLE] WaveNet, WaveRNN, and so on, these
neural autoregressive models have displaced
previous approaches for state of the art. So the high-level
block diagram looks a bit like this on the right. We have our acoustic model
which generates features from the text. And then we have this
autoregressive net which generates the sample waveform. So the autoregressive
part here means that we have this strict
sequential dependency where the output of the model at
time t is fed back as an input at time t plus 1. So similar to challenges
with scaling up distributed synchronous
SGD, the latency constraint due to the sequential
nature of the problem is a real challenge, particularly
in the speech domain where we’re generally sampling
at say 24,000 samples a second, which is an implied
latency of about 40 micros. For folks familiar with stuff
like, say, standard resonates on CPUs or GPUs, this is about
a couple orders of magnitude smaller. So taking a WaveRNN style
model as an example, so this is about a year old. How this works is that we
feed our acoustic features into a stack of RNNs, then
some fully connected layers. And then we sample
from the softmax. So earlier, this team, one
of our groups at Facebook had built an incredible
sounding synthesizer that we really wanted to ship
to replace our existing stack. But the latency
challenges in this model were insane, nearly
two orders of magnitude of where we needed to be. So even a back of the envelope
speed of light analysis, assuming we use
all our arithmetic units on our hardware, we still
wouldn’t be able to get there. So this was a real challenge. But approaching this from
a joint software/hardware modeling perspective
was really powerful. And so I’ll just step through
a couple of layers in the stack where we used TVM. So there’s a couple of obvious
low-hanging fruit here. So the first is getting rid
of the framework overheads. So in our typical machine
learning frameworks, every operator you execute,
every function call has some overhead, which
the framework introduces. So this is typically
in the orders of hundreds of nanoseconds
to a microsecond or two. So given this sort of 20
microsecond budget and half a dozen or a dozen ops, we’d
be spending our entire overhead just on the framework itself. So whole graph compilation
with tools like Relay is a really great
answer to this. So given this
problem has entirely static shapes and
a static data flow, we can almost entirely
eliminate the overhead using the existing TVM graph runtime. So the other area where
the whole graph compilation is incredibly beneficial
right off the bat is through operator fusion,
as we’ve had before. So given this sampling
operates at batch size 1, almost all our operators
are memory-bandwidth bound. So all the fusion examples
are incredibly efficient here. The optimized code generation
for stuff like GEMV is also incredibly helpful here. But this helps 10%, 20%. And there’s a whole
bunch more to go. So the next direction to look in
is this model software codesign in stuff like sparsity. So sparsity is helpful
for a couple of reasons, both important in
this application. So it reduces the number
of FLOPS executed, which given our previous
speed of light analysis makes this problem a
lot more approachable. The second area
where it helps a lot is reducing the cache
footprint of these models since we reduced the size
of our weight matrices which dominate the
cache footprint. Why this helps a
lot in this case is it allows us to keep our
models resident entirely in stuff like L1 or across the
per-core L1 or L2 data caches on our CPUs, which is incredibly
important to avoid expensive reads from L3 or main memory. Generally speaking,
block-sparsity and structured and unstructured sparsity we
think is incredibly important. And it’s going to be
really dominant feature. I imagine we’ll hear
a lot more about it this afternoon from AWS. And we’ll also hear a lot
more about it next year. It’s becoming– I think it’s
a really big opportunity for a lot of our ML systems
and our ML compilers to take advantage of these. And we can go further
with tools that reduce precision storage
for our weights as well. Once we go through
all this, we end up spending a surprising
amount of time in our transcendental
computation. So normally, a lot
of the intuition we build up in a
bunch of these domains is that nonlinearities
like ReLU, sigmoid, Swish, and so on are
pretty trivial parts of our computation. Once we optimize the
dense math and so on, we end up spending
a lot of time here. So in a lot of
back ends with TVM, we end up invoking external
functions like [INAUDIBLE] and so on for our
transcendental functions. And so what we really want– we spend a lot of time here. These aren’t cleanly vectorized. We might spend a
bunch of time in say high– very high precision
polynomial evaluation or a bunch of conditionals. And so what we really want is to
express this entirely in TVM IR and fuse it– vectorize
it and so on entirely. So with a couple of
modifications to Relay, we’re able to express these
kind of approximations. And not only that, by
expressing them pure in Relay, we can also optimize
the polynomial degrees and coefficients with
tools like [INAUDIBLE] to allow us to trade off
performance accuracy. So putting it all
together, this is actually a pretty small project
in the scheme of things. We apply it. We build a couple of new Relay
primitives for sparsity city and so on. We add a couple of
lines into Relay in TVM to express these new primitives. We introduce a few knobs to
further TVM multithreading runtime. We use tools like AutoTVM to
guide our architecture search. How we can do this in a couple
of days, which is an incredibly productive tool gen? And so the end-to-end
results looks something like this, where we start out
with this very high latency model. And we can apply whole
graph compilation, apply unstructured sparsity,
then structured sparsity, reduce the position
of our storage, and then approximate
our transcendentals and zooming in just to see
the really relevant part of the graph. We can see how all these
optimizations stack together. So really, this is a really
pretty great result for us. We’re able to ship it. And not only that,
we can do this without a single line of
architecture-specific code. So we can validate this
against alternative approaches such as just hand-implementing
the whole thing in C or assembly intrinsic. And we can beat that by
substantial margins on held out architecture so to speak. Another nice thing, because
of the portability of TVM is that we get real-time
on mobile for free. So we can apply this with
modified architectures or modified platforms,
reuse 99% of the code. And so this is a
really great case where we can deliver on the
promises of performance, portability, and
generalizability. So with that, I’ll
hand over to Bram to talk about the
use of sparsity in modern deep
learning compilers. Thanks. [APPLAUSE] BRAM WASTI: Hi. I’m Bram. And I’m just going to be
talking about some research ideas and just things that
we’re interested in at Facebook with regard to sparsity. And then I’m going to
switch gears and talk about PyTorch integration of
TVM on more of a programming languages realm. So sparsity has
been around forever. L1 regularization
naturally leads to this. It’s almost an artifact
that you get for free. But in the past, I think it
was mostly used for accuracy. You weren’t really getting
any performance opportunities from it. But what we found is that
during inference time, sparsity can actually really,
really help your model. There are more
complex loss terms you can actually embed into
your model that has been solved. And this is actually
from a couple of years ago with different
optimization techniques. And there’s been
some success there for convolutional neural nets. There’s also some interesting
papers coming out these days about sparsity that are
a little more complicated than just get as many zeros
in the model as possible. There’s this lottery
ticket hypothesis coming from, I think,
MIT, which basically states that using standard
pruning techniques, you can get subnetworks
that are just better. You can then reset
the weights back to the original values,
retrain the model with just that subnetwork. And it trains faster if– equally as fast, if
not faster, which is a really interesting
and nonintuitive result. This just indicates
that sparsity is pretty much something
we should be looking into. Like you get it for free almost
in a lot of these papers. What we found with the
lotto ticket hypothesis is– or sorry,
but [INAUDIBLE] he found with lotto
ticket hypothesis has been very influential. And we’re continuing along
that path at Facebook. But there are other
things that you can do to get sparsity in the model. There are two
factorization techniques I’d like to talk
about, one of which was released by Open AI for
their Sparse Transformers. They basically factorize
the model into– sorry– factorize a dense
layer into two sparse layers. So it’s like a two-step process. When you multiply by
one sparse matrix, then another sparse matrix,
you effectively recover all of the
connections from that original dense matrix. There’s obviously a whole
bunch of different types of factorizations. You can explore this link
to get more information on exactly what these two are. But this has been
really influential work when it comes to
Sparse Transformers because you can
save a lot of FLOPS by having these now
easily optimized memory-bound operations. Another interesting bit of
factorization techniques that came out is butterfly matrices. These approximate different
types of transformations through sequential factors of– sorry– sequential applications
with sparse matrices. They use the canonical
example of FFTs and show how those can
actually be factored into– they interpret the FFT
as a recursive algorithm and then factor that into
a repeated application of sparse matrices. They then go on to say, hey,
if you can do this with FFTs, you can probably do this with
[INAUDIBLE] transformations. And it’s a really nice
write-up that I highly recommend you check
out because you get all of these sparse
matrices, which can then be used at inference
time later or maybe even during training time. And this is the stuff we’re
interested in at Facebook. So PyTorch has support
for a lot of this stuff. We do have a pruning API as well
as a tutorial that hasn’t quite been landed yet. But there is a pull request. And I urge you to
check these out. There’s a whole
bunch of techniques that you don’t need
to rebuild yourself. There is random L1 and even
L arbitrary n factorization– sorry– regularization
techniques that you can use to get
zeros in your models. But then there’s also
types of structure that you can induce
in the model. Having zeros randomly is
not always as of useful when it comes to performances. Having zeros maybe
on certain channels so that one channel
just gets through– eliminated entirely. And you can just
run things faster. Or blocks of values as we
saw with Andrew’s work. Block sparsity allows you to
get a pretty substantial wins. You can then also use
your own custom masks. All of this work was done by
Michela Paganini at Facebook. So let’s talk about performance. So we can see here
is some work done by Aleks Zi and Jongsoo Park. And it’s currently
written in asmjit. Doesn’t need to be. We could use TVM. This is just kind
of the deep dive into why we’re
getting performance and how that is
actually achieved. So what we do is we embed the
weights directly into the code. As we can see in
the assembly here, not everything is being fmadded. It seems like almost a random
structure of that fmadds and broadcasts here. And that’s just because what
would multiply out to a zero can just be skipped. You unroll the loops entirely. You eliminate a whole
bunch of fmadds and loads. And you get perf from that. Skipping all of these
loads and max actually did some really
interesting results. We saw that with 90%
unstructured sparsity, we get a 2.3x win for a batch
size 1, 256 by 256 [INAUDIBLE].. This is actually
a very common size that we were using in a model. So this was really,
really nice to see. But if we reduce the
sparsity to only 80% and add a bit of structure, we
can get substantially better results. This comes from the fact that
we’re just loading vectors. And if those loads
are contiguous, you can skip more
contiguous zeros. It’s pretty obvious stuff
when you think about it. But not everyone is
going to necessarily have that full picture
of, how do you actually train a model to do this? That actually turns
out to be quite hard. So there’s a lot of
interactions that need to happen
between researchers and then also the folks who are
targeting certain architectures and know that a width of 8 is
preferable to a width of 4. Or maybe even a width
of 16 is better. We saw that, actually, you
get super-linear results here, which is kind of
cool because it’s one fifth the number of weights. But because now you’re
basically an L1 cache, you can do a bit better. We see 70 effective
gigaflops over the 11 that we were getting
earlier, which, I mean, this is on some random CPUs. So these numbers are not
really representative. But the relative scale
here is pretty impressive, which is that it’s six times
faster basically for free if you can get a model
trained that way. So again, PyTorch has
training mechanisms that are kind of plug and play. You can drop these in
and train sparse models. So I’d recommend checking
that out and then using techniques like this to get the
inference fee that you want. So concluding on this
idea of sparsity, there’s a new interface
or interaction that is starting to happen
between model developers and then the folks who
are writing compilers. Sparsity is pretty
easy to achieve, and you get the performance. But model developers
aren’t going to necessarily know how much performance. And this is a new concept. The visibility into
what is good or what is bad when training a model
is now weight-dependent rather than just the sizes. So tuning various
hyperparameters there might not actually
do much for you. Instead, you’re going to have to
either figure out regularizers that naturally lead
to this block sparsity or just manually embed
the block sparsity and see if you can train
the model from there. This kind of visibility can
be solved in a bunch of ways. And there are a lot
of ideas about this. Potentially, you
can just– putting the performance
in a loss function would be kind of interesting. But this is something
that I think needs to just happen more frequently. It’s like an interaction
between folks who are more interested in the
research of training models and then talking to and really
understanding the implications of those models that
are being trained and how fast they
can potentially run during inference time. So this the new challenge I
think that we’re definitely facing at Facebook. So shifting gears
a bit, I’d like to talk about how PyTorch
and TVM are being integrated. So this is now– I’m going to jump more over
to programming languages and how we’re doing things at
the PyTorch flexibility level and still being able to leverage
a lot of the performance wins we get in TVM. So this repository
here at pytorch/tvm lowers TorchScript
graphs into Relay. A lot of the work here is done
by Kimish, Lingyi, Wanchao, and Yinghai. A bunch of others
also have contributed. And you’ll probably want
to check out this blog if you’re really
interested, which has a little write-up
on how this was done. We can see the results
are really good. This is a bit old. It’s using ResNets,
and it’s just obvious that if you drop into a
substantially faster back end that is compiling
things, you get wins. And the way we did this
was by using TorchScript. So TorchScript is the
static interface PyTorch where you lower Python
directly into an IR that can be manipulated. But even though you
can lower things into an IR that
can be manipulated, doesn’t mean you can
optimize it very easily. It’s not fun at all. The workloads being run
are not too complicated. You just need maybe a bit
of control flow, which is always a great thing to see. But when you have
a language that can change the
types of variables and then modify its
own functions as they run on the fly, it’s really
hard to prove things and then optimize them. So TorchScript was developed
to have this portability. But we are trying to get
performance from that. And we have two
techniques that we’re using that I’d
like to talk about and which you can go from
this generic Python runtime into a very fast, much
more optimizable interface that we can then
lower into Relay and then let Relay do all
its crazy magic and TVM do all its crazy magic
to get performance. So the first way that
I think a lot of folks have come to the conclusion
is an interesting and easy way to get this going
is Lazy Tensor. Instead of executing
operations immediately, you just accumulate
either a tape or a graph of the operations
and then execute them as late as possible. So at some point, you
want to execute them. You now have a
whole tape or graph, which can then be converted
into some IR, lowered, compiled. And then if you don’t
want to compile it again because it’s going to be the
same workload pretty much every time, you just
look it up in a cache. This has proved to
be pretty useful. We have a PR up–
it’s not landed– that doesn’t really
show all the perf wins because a lot of this
stuff is internal. But it does work. And you can use it if you
are interested in this. There are a lot
limitations though. If you have very weird
dynamic graphs that have a ton of control
flow, lazy tensors will not capture any
of those semantics. You’re always going
to be recompiling. And you’re not even getting
a perf win at that point. You’re just slowing
everything down. Similarly, if you’re only
compiling some of the time, you’re going to
randomly hit cliffs where, oop, your workload
hasn’t been compiled before. Now everything is slow. So sudden changes
in various kind of interesting details
with deployment often make Lazy
Tensor not feasible. So instead, we thought of the
idea of a profiling executor. Sorry for the
graphic on the side not being very informative
or even readable. But basically,
what we’re showing is an IR that has
inserted guard nodes. And I’ll talk about
that in a sec. So profiling executor executes
everything immediately as you normally would. It’s just like PyTorch in
terms of its interface, but it’s recording everything. So it’s accumulating statistics
about shapes, consonants, d types, even the different kind
of variability of the shapes. So if the leading
dimension changes over the course of the
program, it’ll record that. If the tail dimensions
aren’t changing, it’ll say, hey, this
is pretty much static. This information is
super-useful because then you can write guard nodes or
insert guard nodes manipulating the IR that say, hey. I’ve assumed that most of the
time, this leading dimension is going to change. This third dimension is
always going remain static, and it’s going to be constant– or sorry. Not constant, or else the first
dimension wouldn’t change. It would be like float
or something like that. And then if that isn’t true,
fall back into the interpreter. Do the slow path, whatever. That’s OK. But in the case that it is true,
we can now optimize a subgraph. And this subgraph can
be handed off to Relay, optimize crazy amounts, and
then use more frequently than everything else. What we’ve now isolated
is a very common workload that we can then use control
flow and a whole bunch of other things to
optimize while still having the flexibility and
maintainability that PyTorch offers developers where
in the worst case, you fall into an interpreter
which can still do everything that you used to. With Python, we found that this
is like one of the best ways to actually get perf for the
things that we care about. We don’t want to
hinder researchers who are developing really
complicated models and changing functions on the fly. We just want to let
them do their thing and still, in the common case,
get a lot of performance. There are some downsides. You need to run
a couple of times before you get performance. So when it comes
to deployment, you may want to do like
crazy caching things or ship the precompiled
version or even ship the pretransformed version. And then the IR itself
is very complicated. If you start dumping these
things as we can see, there’s a lot of guard nodes. There’s a whole bunch of stuff
that you need to analyze. So if you still want to do
whole graph optimization, it becomes just almost– it becomes very, very difficult,
not infeasible though. Cool. So couple next steps. We’re really excited about
the performance of TVM. We’ve been using internally
for a whole bunch of different things. We’re working to more tightly
integrate PyTorch in TVM. And I think this is going to
be a big thing going forward, like TVM is clearly
the way to go. It’s got a whole bunch of
different abstractions that are– especially the
ones based on Halide– very useful and convenient. You can get things
done very quickly. As Andrew mentioned
earlier, it only takes a week to get
massive performance wins. So we’re really
excited about that. And I’d like to give a big
thanks to the community, Andrew as well, and I guess open
up for some questions for the next five minutes. [APPLAUSE] LUIS CEZE: Question
before we break? AUDIENCE: Do you think we’re
headed for a huge [INAUDIBLE]?? So sparsity seems to
be very effective. Do you think we’re heading
for a future where you would– down the line, you would never
do anything that’s not sparse? BRAM WASTI: So you
never do anything that– AUDIENCE: Do you
think we’re heading for– headed for a future
where we would never do anything that’s not sparse? ANDREW TULLOCH: I
wouldn’t say that. Like, I mean, today,
like quantization is very effective in almost all cases. But still, we– it’s like
everything that uses– like [INAUDIBLE]. We always use quantized
models everywhere or all our hardware only
supports quantized models. So I think with that, I’d
say it’s more like what where quantization
was in 2015 or so before a bunch of
our back ends started supporting it as easily. A bunch of our hardware
supported it as easily. I think that’s more the analogy. BRAM WASTI: Yeah. I think when it comes to
even defining sparsity, I like to think of it
more as like zeroware. I think in the future,
everything will be zeroware, but not everything is
going to have zeros. AUDIENCE: Hi. I noticed you
didn’t mention glow. And I’m just curious, you know– not to– [LAUGHTER] –not to be religious about it. But how do TVM and glow
fit into the ecosystem? And how are they different
in your perspective? [LAUGHTER] BRAM WASTI: TVM targets
so many back ends. Glow has been like an
accelerator/compiler. So I think, especially
the work that Andrew and I have been doing is largely– we’ve optimized
something on x86. And then folks are like, hey,
we want to shift this to phones. Then, boom. We need to now have it in ARM. TVM fills that gap
very, very well for us. So that’s kind of how
it’s been filling out the role at Facebook, I think? ANDREW TULLOCH:
Yeah, like probably, like we’ll have more
to announce init. But, yeah. Sorry, we’ll have
more to announce init. But yeah, like we’re sort
of converging those– those two, which
is a net benefit. AUDIENCE: A question
about when we see now more and
more deep learning frameworks such as
PyTorch and MXNet and even TensorFlow tried to
align with the concept of TVM, a compiler. How do you see the concept
of unified model repetition like ONIX still relevant
or even contributing to the TVM concept of compiling
on deep learning models? Thank you. BRAM WASTI: Yeah, it’s
an interesting question. I think the ideas are
somewhat orthogonal. ONIX is more of like a
serialization format. And I think it is
still definitely compatible with the ideas. I don’t have much
experience dealing with ONIX or interacting with it. But I think when it comes to
a unified representation that can be shipped around, it
is still a very useful tool. AUDIENCE: So the integration
has support for training? BRAM WASTI: So
there is a PR out. You can check it out that
does that, this support. It is a work in progress. But we are– because
of the various ways you can actually extract IR from
PyTorch now using LazyTensor as an example, you basically
just get a compute graph even if it is a
training compute graph. And all of that
can be run by TVM. LUIS CEZE: [INAUDIBLE]
one more question here. AUDIENCE: I think sparsity
is very interesting. But modern networks use
depthwise– how many use depthwise convolution
and pointwise convolution? And those are very difficult
to regularize with one– with one regularization or
[INAUDIBLE] regularization. Every time any
performance experiment to see what gain can be
achieved with those networks with depthwise convolution
and pointwise convolution. ANDREW TULLOCH: So I think
a paper worth checking out, if you haven’t, was worked on
by one of our old colleagues right now at Google– called like Efficient
Sparse ConvNets at– submitted to ICLR this
year– oop, next year, which does demonstrate like in
the case of say MobileNet V2 getting to like 80%, 90%
sparsity on the dense layers and achieving a couple– like a factor of two or
three kind of reduction in effective FLOPS or
like effective performance on an x86. So it’s definitely achievable
on traditionally factorized Delphi separable models, like
MobileNet V1 and V2 and V3. So is that responsive, or– OK. Thanks. LUIS CEZE: All right. Let’s thank our speakers again. ANDREW TULLOCH: Thank you. [APPLAUSE]

One Comment

Add a Comment

Your email address will not be published. Required fields are marked *