速看：一个算子在深度学习框架中的旅程-ag真人官方网

来源：csdn博客 | 2022-06-17 08:47:33 |

撰文｜赵露阳

算子即operator，这里简称op。op是深度学习的基础操作，任意深度学习框架中都包含了数百个op，这些op用于各种类型的数值、tensor运算。

在深度学习中，通过nn.module这样搭积木的方式搭建网络，而op就是更基础的，用于制作积木的配方和原材料。

(资料图)

譬如如下的一个demo网络：

import oneflow as torch class tinymodel(torch.nn.module): def __init__(self): super(tinymodel, self).__init__() self.linear1 = torch.nn.linear(100, 200) self.activation = torch.nn.relu() self.linear2 = torch.nn.linear(200, 10) self.softmax = torch.nn.softmax() def forward(self, x): x = self.linear1(x) x = self.activation(x) x = self.linear2(x) x = self.softmax(x) return xtinymodel = tinymodel()print("the model:")print(tinymodel)

从结构来看，这个网络是由各种nn.module如linear、relu、softmax搭建而成，但从本质上，这些nn.module则是由一个个基础op拼接，从而完成功能的。这其中就包含了matmul、relu、softmax等op。在oneflow中，对于一个已有op，是如何完成从python层->c 层的调用、流转和执行过程？本文将以

output = flow.relu(input)

为例，梳理一个op从python -> c 执行的完整过程。

首先，这里给出一个流程示意图：

下面，将分别详细从源码角度跟踪其各个环节。

binding

这里，binding是指python和c 代码的绑定。通常，我们用python搭建网络，训练模型，调用函数完成各种操作。实际上，这些函数通常在python层只是一层wrapper，底层实现还是通过c 代码完成的，那么python -> c 是如何调用的？这就需要用到python和c 的绑定。

在深度学习框架的实现中，即可以用python原生的c api，也可以通过pybind11来完成函数绑定，在oneflow中，二者均有使用，譬如：

oneflow/api/python/framework/tensor.cpp

oneflow/api/python/framework/tensor_functions.cpp

中涉及到的 tensor.xxx 方法都是通过python c api完成了函数绑定；

oneflow/core/functional/functional_api.yaml

中定义的诸多 flow.xxx 方法则是通过pybind实现的绑定。这里关于python c api和pybind不做过多介绍，具体用法可以参考相应文档：

https://docs.python.org/zh-cn/3.8/c-api/index.html

https://pybind11.readthedocs.io/en/stable/index.html

下面我们回到flow.relu方法，我们在python层调用的flow.relu实际是调用了在

python/oneflow/__init__.py

中定义的oneflow._c.relu。 _c表示其实现位于底层c 。和pytorch类似，我们也基于.yaml定义了一套接口导出及code gen的规则，譬如在 functional_api.yaml 中，我们可以看到relu的导出接口的函数签名：

- name: "relu" signature: "tensor (tensor x, bool inplace=false) => relu" bind_python: true

从yaml定义可以看出，flow._c.relu 接收两个参数，tensor和一个bool值，其绑定了c 的relu方法，函数返回值也是tensor。实际上，在oneflow编译时，会通过执行

tools/functional/generate_functional_api.py

这个文件，对 functional_api.yaml 进行解析和代码生成，动态生成c 的.h和.cpp文件。

build/oneflow/core/functional/functional_api.yaml.h

build/oneflow/core/functional/functional_api.yaml.cpp

并在.cpp文件中调用相应的functor完成c 层面的函数调用。这里，还是以flow._c.relu为例，其对应的functor定义位于oneflow/core/functional/impl/activation_functor.cpp:

class relufunctor { public: relufunctor() { op_ = check_just(one::opbuilder("relu").input("x", 1).output("y", 1).build()); } maybe operator()(const std::shared_ptr& x, bool inplace) const { ... } private: std::shared_ptr op_;};

‍

relufunctor通过

oneflow_function_library(m) { m.add_functor("relu"); ...}

‍

完成functor的注册，注册成functional接口后，在python层flow._c.relu就完成了和“relu”的绑定。同时，这个函数在c 中也可以通过functional::relu直接调用。

2

functor

functor不仅是python -> c 交互的核心，也是op调用、输入参数推导和检查的第一站。通常，各种op在functor层需要完成对输入tensor的shape、dtype、维度、元素个数等各种check，以及对op特有的逻辑进行解析和处理。relu functor代码如下：

class relufunctor { public: relufunctor() { op_ = check_just(one::opbuilder("relu").input("x", 1).output("y", 1).build()); } maybe operator()(const std::shared_ptr& x, bool inplace) const { if (inplace) { just(checkinplacevalid(x)); std::shared_ptr outputs = std::make_shared(1); outputs->at(0) = x; just(opinterputil::dispatch(*op_, {x}, outputs.get(), attrmap{})); return outputs->at(0); } else { return opinterputil::dispatch(*op_, {x}); } } private: std::shared_ptr op_;};

可以看见，relufunctor是比较简单的，其定义了一个私有变量

std::shared_ptr op_;

这个op_即需要执行的relu op，通过opbuilder进行构建；functor的operator()内部，根据是否inplace走到2个不同分支，并最终通过opinterputil::dispatch()将op、输入tensor和参数派发至interpreter处理。

3 dispatch

各种op在functor中完成check和逻辑处理后，大多需要通过opinterputil::dispatch() 进行派发，其目的地是interpreter。在interpreter中，将会对op进行更进一步的处理。在oneflow/core/framework/op_interpreter/op_interpreter_util.h中，我们可以看见多种重载的dispatch模板代码：

class opinterputil { public: template static maybe dispatch(const opexpr& op_expr, const tensortuple& inputs, const attrmap& attrs) { return dispatch(op_expr, inputs, opexprinterpcontext(attrs)); } template static maybe dispatch(const opexpr& op_expr, const tensortuple& inputs) { return dispatch(op_expr, inputs, opexprinterpcontext(attrmap{})); } template static maybe dispatch(const opexpr& op_expr, const tensortuple& inputs, const opexprinterpcontext& ctx); static maybe dispatch(const opexpr& op_expr, const tensortuple& inputs, tensortuple* outputs, const attrmap& attrs) { return dispatch(op_expr, inputs, outputs, opexprinterpcontext(attrs)); } static maybe dispatch(const opexpr& op_expr, const tensortuple& inputs, tensortuple* outputs) { return dispatch(op_expr, inputs, outputs, opexprinterpcontext(attrmap{})); } static maybe dispatch(const opexpr& op_expr, const tensortuple& inputs, tensortuple* outputs, const opexprinterpcontext& ctx);

这些重载，是为了应对不同的输入、输出以及opexprinterpcontext的情况。譬如这个opexprinterpcontext是op在interpreter中所需的上下文，可能携带op计算所需要的属性(如conv2d op所需要的kernel_size、padding等)、device、sbp、parallel等描述信息。这些重载的dispatch最终都会走到：

/* static */ maybe opinterputil::dispatch( const opexpr& op_expr, const tensortuple& inputs, tensortuple* outputs, const opexprinterpcontext& ctx) { return just(getinterpreter(inputs, ctx, op_expr))->apply(op_expr, inputs, outputs, ctx);}

dispatch至此，剩下的就要交给interpreter了。

4 interpreter

get interpreter

这里先看看getinterpreter，这里其实就是获取所需的interpreter，来负责op接下来的执行。省略check相关的逻辑，主要代码如下：oneflow/core/framework/op_interpreter/op_interpreter_util.cpp

maybe getinterpreter(const tensortuple& inputs, const opexprinterpcontext& ctx, const opexpr& op_expr) { static const auto& g_lazy_interpreter = buildlazyinterpreter(); static const auto& g_eager_consistent_interpreter = buildeagerinterpreter(/*is_mirrored=*/false); static const auto& g_eager_mirrored_interpreter = buildeagerinterpreter(/*is_mirrored=*/true); if (!lazymode::is_enabled()) { if (inputs.empty()) { if (ctx.parallel_desc.has_value()) { just(ctx.nd_sbp); check_or_return(!ctx.device.has_value()); return g_eager_consistent_interpreter; } else { check_or_return(!ctx.nd_sbp.has_value()); return g_eager_mirrored_interpreter; } } else { if (inputs.at(0)->is_consistent()) { ... return g_eager_consistent_interpreter; } else { ... return g_eager_mirrored_interpreter; } } unimplemented_then_return(); } return g_lazy_interpreter;}

通过上面的逻辑可以看出，interpreter大体上分为eager interpteter和lazy interpreter；其中eager interpteter又根据eager mirrored和eager consistent有所区别。具体就是以下3种子类实现：

eagermirroredinterpreter

eagerconsistentinterpreter

lazyinterpreter

普通的eager mode下（无论是单卡还是ddp的情况）都会走到 eagermirroredinterpreter 的逻辑；在普通eager mode之外，为输入tensor设置了sbp、placement则会进入到eagerconsistentinterpreter的逻辑；在lazy mode时（使用nn.graph），则会进入到lazyinterpreter。

下面，我们看下这3种interpreter的构建：

std::shared_ptr buildeagerinterpreter(const bool& is_mirrored) { std::shared_ptr internal; if (is_mirrored) { internal = std::make_shared(); } else { internal = std::make_shared(); } return std::make_shared(internal);}std::shared_ptr buildlazyinterpreter() { auto internal = std::make_shared(); return std::make_shared(internal);}

可见，这3种interpreter构建完成后，都会以私有变量internal的形式，参与autogradinterpreter的构建，并最终返回autogradinterpreter。

class autogradinterpreter { public: autogradinterpreter() = delete; autogradinterpreter(const std::shared_ptr& internal) : internal_(internal) {} virtual ~autogradinterpreter() = default; maybe apply(const opexpr& op_expr, const tensortuple& inputs, tensortuple* outputs, const attrmap& attrs) const { return apply(op_expr, inputs, outputs, opexprinterpcontext(attrs)); } maybe apply(const opexpr& op_expr, const tensortuple& inputs, tensortuple* outputs) const { return apply(op_expr, inputs, outputs, opexprinterpcontext(attrmap{})); } maybe apply(const opexpr& op_expr, const tensortuple& inputs, tensortuple* outputs, const opexprinterpcontext& ctx) const; private: std::shared_ptr internal_;};

apply()

通过上面我们知道，eagermirroredinterpreter、eagerconsistentinterpreter和lazyinterpreter都将为其包裹上autogradinterpreter的壳，通过autogradinterpreter触发apply的调用。顾名思义，autogradinterpreter的作用主要是和autograd相关，其主要为eager mode下前向的op节点插入对应的用于反向计算grad的节点。

我们看看这部分代码，关键部分的作用在注释里给出：

maybe autogradinterpreter::apply(const opexpr& op_expr, const tensortuple& inputs, tensortuple* outputs, const opexprinterpcontext& ctx) const { // 判断是否需要计算梯度，如果处于gradmode的作用域切改op注册时没有禁用梯度 // 则requires_grad的值根据输入tensor的requires_grad属性判断 // any of input tensors requires_grad==true，则表示需要计算梯度 bool requires_grad = false; if (autograd::gradmode::is_enabled() && !just(op_expr.isgraddisabled())) { requires_grad = std::any_of(inputs.begin(), inputs.end(), [](const std::shared_ptr& tensor) { return tensor->requires_grad(); }); }// 这一坨逻辑比较丑陋，是因为近期支持了oneflow系统中支持了stride&&view机制// 而大部分op尚未注册stride推导、尚未支持non-contiguous的输入tensor// 所以需要在这对这部分op的输入进行强制转换，将其变为contiguous的// note: if this op not support stride, then need to tensor->contiguous()#define handle_non_contiguous_input(tensor_tuple_ptr) \ tensortuple tmp_inputs; \ if (!lazymode::is_enabled() && !just(op_expr.supportnoncontiguous())) { \ tmp_inputs.resize(inputs.size()); \ for (size_t i = 0; i < inputs.size(); i ) { tmp_inputs[i] = inputs[i]->contiguous(); } \ tensor_tuple_ptr = &tmp_inputs; \ } const tensortuple* inputs_ptr = &inputs; handle_non_contiguous_input(inputs_ptr); // 这里是进行实际interpreter执行的主要过程 { autograd::autogradmode mode(false); just(internal_->apply(op_expr, *inputs_ptr, outputs, ctx)); } // 这里主要是为了eager mode下，且requires_grad==true的op， // 插入反向节点(addnode)用于autograd，该节点包含反向梯度计算的方法(backward_fn) // lazy mode will construct backward compute graph in passes, so disable autograd if lazy mode. std::shared_ptr grad_closure(nullptr); if (requires_grad && !lazymode::is_enabled()) { grad_closure = just(op_expr.getorcreateopgradclosure()); auto backward_fn = std::make_shared(); backward_fn->body = [=](const tensortuple& out_grads, tensortuple* in_grads, bool create_graph) -> maybe { autograd::autogradmode mode(create_graph); just(grad_closure->apply(out_grads, in_grads)); return maybe::ok(); }; backward_fn->status = [=]() { return grad_closure->state()->savedtensors().size() > 0; }; just(getthreadlocalautogradengine()->addnode(op_expr.op_type_name() "_backward", backward_fn, *inputs_ptr, outputs)); } // update outputs autograd meta // note: if requires_grad is true, we will create a new autograd meta for each output // in `addbackwardfuncptr` to support inplace operation, so the update should after // `addbackwardfuncptr` for (auto& output : *outputs) { output->set_is_leaf(inputs_ptr->size() == 0 || !requires_grad); ... if (!output->requires_grad()) { just(output->set_requires_grad( requires_grad && issupportrequiregraddatatype(output->dtype()->data_type()))); } } // 捕获前向的inputs outputs，反向计算时可能用到 if (requires_grad && !lazymode::is_enabled()) { // capture inputs and outputs after `addbackwardfuncptr` because of that grad function // node has been attached to them. just(grad_closure->capture(*inputs_ptr, *outputs, ctx)); } return maybe::ok();}

上面一坨逻辑有点多，让我们看一下重点，对于简单的relu op，我们只需关注这部分代码：

// 这里是进行实际interpreter执行的主要过程 { autograd::autogradmode mode(false); just(internal_->apply(op_expr, *inputs_ptr, outputs, ctx)); }

这里，还是以上面的flow.relu为例，由于是简单的eager mode，所以实际会走到eagerinterpreter的apply方法：

maybe eagerinterpreter::apply(const opexpr& op_expr, const tensortuple& inputs, tensortuple* outputs, const opexprinterpcontext& ctx) const {#define apply_if(op_type) \ if (const auto* op = dynamic_cast(&op_expr)) { \ return applyimpl(*op, inputs, outputs, ctx); \ } apply_if(userop); apply_if(variableop); apply_if(casttomirroredop); apply_if(castfrommirroredop); apply_if(consistenttoconsistentop); apply_if(casttoconsistentop); apply_if(castfromconsistentop); apply_if(distributesplitop); apply_if(distributecloneop); apply_if(distributeconcatop); apply_if(distributeaddop); apply_if(functionop); apply_if(selecttopnop)#undef apply_if of_unimplemented() << "the type " << op_expr.op_type_name() << " has not been supported in eagerinterpreter::apply.";}

‍

这里，通过宏定义apply_if，增加了对不同类型op的分支处理。对于大多数用户来说，用到的op都是userop类型，所以这里实际上会走到这个分支中：

if (const auto* op = dynamic_cast(&op_expr)) { return applyimpl(*op, inputs, outputs, ctx); }

再看看eagermirroredinterpreter::applyimpl，位于

oneflow/core/framework/op_interpreter/eager_mirrored_op_interpreter.cpp：

maybe eagermirroredinterpreter::applyimpl(const useropexpr& op_expr, const tensortuple& inputs, tensortuple* outputs, const opexprinterpcontext& ctx) const { return naiveinterpret(op_expr, inputs, outputs, ctx);}

其最终实现是naiveinterpret。

naiveinterpret

naiveinterpret简单来说，主要用于做以下几件事：

check input tensor的device是否一致

生成output tensor

为output tensor推导和检查shape/stride/dtype

构建op执行指令，并派发至vm

简化版的代码如下：

maybe naiveinterpret(const useropexpr& user_op_expr, const tensortuple& inputs, const symbol& default_device, tensortuple* outputs, const opexprinterpcontext& ctx) { const auto& attrs = ctx.attrs; std::shared_ptr input_eager_blob_objects = std::make_shared(inputs.size()); // check devices for (int i = 0; i < inputs.size(); i ) { const auto& input_device = just(inputs.at(i)->device()); if (i > 0) { check_or_return(*default_device == *input_device) << error::runtimeerror() << "expected all tensors to be on the same device, but found at least two devices, " << default_device->tostring() << " (positional 0) and " << input_device->tostring() << " (positional " << i << ")!"; } input_eager_blob_objects->at(i) = just(inputs.at(i)->eager_blob_object()); } // make output tensors std::shared_ptr output_eager_blob_objects = std::make_shared(outputs->size()); auto* output_tensor_metas = threadlocaldefaultoutputmuttensormetas(outputs->size()); for (int i = 0; i < outputs->size(); i ) { if (!outputs->at(i)) { const auto& tensor_impl = std::make_shared(); outputs->at(i) = std::make_shared(tensor_impl); output_tensor_metas->at(i) = tensor_impl->mut_tensor_meta(); } else { bool has_eager_blob_object = just(outputs->at(i)->has_eager_blob_object()); check_or_return(has_eager_blob_object); output_eager_blob_objects->at(i) = just(outputs->at(i)->eager_blob_object()); } } symbol stream; bool need_check_mem_case = true; // infer devices ... // infer shapes strides dtype ... // 构建op执行指令，并派发至vm just(physicalrun([&](instructionsbuilder* builder) -> maybe { return builder->localcallopkernel(kernel, input_eager_blob_objects, output_eager_blob_objects, ctx, stream); })); return maybe::ok();}

interpreter的终点是虚拟机（vm）。vm部分，是oneflow比较独特的设计，内容很多，这里暂不展开了：）可以简单理解，派发至vm后，此op将进入一个任务执行的队列，将会等待其vm的调度、执行。

5 compute

在interpreter将op执行指令派发至vm后，经过调度逻辑处理后，将会在

oneflow/core/eager/opkernel_instruction_type.cpp

被触发执行，核心代码如下：

static inline void opkernelcompute( localcallopkernelphyinstroperand* operand, devicectx* device_ctx, user_op::opkernelstate* state, const user_op::opkernelcache* cache) { auto* opkernel = operand->mut_opkernel(); auto* compute_ctx = opkernel->updatecomputecontext(operand->inputs().get(), operand->outputs().get(), operand->consistent_tensor_infer_result().get(), device_ctx); ... operand->user_opkernel()->compute(compute_ctx, state, cache); opkernel->updatecomputecontext(nullptr, nullptr, nullptr, nullptr);}

其中，

operand->user_opkernel()->compute(compute_ctx, state, cache);

将触发op kernel的实际执行。通常来说，op的kernel实现根据device的不同，会派发到不同的实现，其一般都位于：

oneflow/user/kernels/xxx_kernel.cpp

或

oneflow/user/kernels/xxx_kernel.cu

这里的relu op相对比较特殊，是用primitive实现的（primitive也是oneflow中一种独特的设计，有着良好的抽象和可组合性），具体这个unaryprimitive就是elementwise unary的模板 unaryfunctor的组合。其调用链如下：

unaryprimitivekernel

class unaryprimitivekernel final : public user_op::opkernel, public user_op::cudagraphsupport { public: of_disallow_copy_and_move(unaryprimitivekernel); unaryprimitivekernel() = default; ~unaryprimitivekernel() = default; using primitivefactoryfunctype = std::function( user_op::kernelcomputecontext*)>; unaryprimitivekernel(const std::string& output_name, const std::string& input_name, primitivefactoryfunctype fn) : output_name_(output_name), input_name_(input_name), primitive_factory_func_(std::move(fn)) {} private: using user_op::opkernel::compute; void compute(user_op::kernelcomputecontext* ctx) const override { auto primitive = primitive_factory_func_(ctx); check(primitive); const user_op::tensor* input_tensor = ctx->tensor4argnameandindex(input_name_, 0); ... const int64_t elem_cnt = input_shape.elem_cnt(); if (elem_cnt != 0) { primitive->launch(ctx->stream(), input_tensor->dptr(), output_tensor->mut_dptr(), elem_cnt); } } bool alwayscomputewhenalloutputsempty() const override { return false; } std::string output_name_; std::string input_name_; primitivefactoryfunctype primitive_factory_func_;};

‍

ep::primitive::elementwiseunary

templateclass elementwiseunaryimpl : public elementwiseunary { public: of_disallow_copy_and_move(elementwiseunaryimpl); elementwiseunaryimpl(scalar attr0, scalar attr1) : attr0(attr0), attr1(attr1) {} ~elementwiseunaryimpl() override = default; void launch(stream* stream, const void* src_ptr, void* dst_ptr, size_t count) override { cpustream* cpu_stream = stream->as(); dst* dst = reinterpret_cast(dst_ptr); const src* src = reinterpret_cast(src_ptr); auto functor = unaryfunctor(attr0, attr1); cpu_stream->parallelfor(0, count, [functor, src, dst](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i ) { dst[i] = functor(src[i]); } }); } protected: scalar attr0, attr1;};

unaryfunctor

这个unaryfuntor根据不同的unaray op类型，特化出不同的具体functor实现，具体到relu op，其实现位于

oneflow/core/ep/common/primitive/unary_functor.h：

templatestruct unaryfunctor { unaryfunctor(scalar attr0, scalar attr1) {} of_device_func dst operator()(src src) const { const src zero_val = static_cast(0.0); if (src <= zero_val) { return static_cast(zero_val); } else { return static_cast(src); } }};

至此，我们已经完成了一个op的python -> c 之旅。从细节上看，是相对复杂的，但从整体流程上看，其实是比较简单的，排除了binding，vm调度机制等细节，其主要过程其实就4个环节：functor -> dispatch -> interpreter -> kernel compute。

实现/新增一个op，通常也不需要管中间的dispatch以及interpreter，我们只需重点关注和该op强相关的部分——functor层面的参数、op逻辑检查，以及kernel compute部分的实际op运算。

（参考代码：

https://github.com/oneflow-inc/oneflow/commit/1dbdf8faed988fa7fd1a9034a4d79d5caf18512d）

其他人都在看

一个tensor在深度学习框架中的执行过程

学习笔记：从python到c 调用过程分析

学习笔记：从functor到opexprinterpreter

学习笔记：从opexprinterpreter到opkernel

李飞飞：我更像物理学家，而不是工程师

手把手推导分布式矩阵乘的最优并行策略

解读pathways（二）：向前一步是oneflow

欢迎下载体验oneflow v0.7.0：github - oneflow-inc/oneflow: oneflow is a performance-centered and open-source deep learning framework.oneflow is a performance-centered and open-source deep learning framework. - github - oneflow-inc/oneflow: oneflow is a performance-centered and open-source deep learning framework.https://github.com/oneflow-inc/oneflow/

关键词：

速看：一个算子在深度学习框架中的旅程-ag真人官方网

2

3

dispatch

4

interpreter

get interpreter

5

compute

unaryprimitivekernel

‍

ep::primitive::elementwiseunary

unaryfunctor

这个unaryfuntor根据不同的unaray op类型，特化出不同的具体functor实现，具体到relu op，其实现位于

本周资讯推荐

2020上半年全球企业区块链发明专利排行榜：阿里巴巴以1457件位列第一

数据：区块链相关企业目前共有4.56万家在业存续的超过4万家

2020年区块链领域全球授权专利报告：支付宝以212件专利数位列全球第一

米粉节神秘大奖：小米3万元全屋智能产品仅需1元

考拉海购宣布升级商品全链路溯源系统引入区块链溯源技术

热点资讯

速看：一个算子在深度学习框架中的旅程-ag真人官方网

2

3

dispatch

4

interpreter

get interpreter

5

compute

unaryprimitivekernel

‍

ep::primitive::elementwiseunary

unaryfunctor

这个unaryfuntor根据不同的unaray op类型，特化出不同的具体functor实现，具体到relu op，其实现位于

图片推荐

要闻

本周资讯推荐

2020上半年全球企业区块链发明专利排行榜：阿里巴巴以1457件位列第一

数据：区块链相关企业目前共有4.56万家 在业存续的超过4万家

2020年区块链领域全球授权专利报告：支付宝以212件专利数位列全球第一

米粉节神秘大奖：小米3万元全屋智能产品仅需1元

考拉海购宣布升级商品全链路溯源系统 引入区块链溯源技术

热点资讯

数据：区块链相关企业目前共有4.56万家在业存续的超过4万家

考拉海购宣布升级商品全链路溯源系统引入区块链溯源技术