5.4. 使用 Packing
5.4.1. 背景
当前 IPU 只支持静态图, 模型的输入 shape 需要是固定的, 动态 shape 会导致模型重新编译. 但在实际应用中, 尤其是自然语言处理类型的应用, 模型输入 sequence length 往往是动态的. 这种情况下, 常规的处理方法是将这些变长数据都先 pad 到 max sequence length, 然后再输入到模型. 然而这种方法会带来很多无效计算, 导致算力的实际利用率低下. 在 IPU 上, 可以使用 Packing 来支持 dynamic sequence length, 提高算力利用率.
5.4.2. Packing 及 Unpacking
这里通过例子来说明什么是 Packing 及 Unpacking. 假设模型输入长度最大是 8, batch size 是 4, 当前有 7 个不同长度的 batch size 为 1 的 request, 长度从 1 到 7, 0 表示 pad 的无效数据, 则 Packing 及 Unpacking 如下图所示:
5.4.3. Transformer-based NLP Models
自 2017 年被提出以来, Transformer 结构应用领域不断扩展, 从最初的 NLP 扩展到今天的 ASR/CV/DLRM 等领域. Transformer 包含 Encoder 和 Decoder 部分, 本文只关注 Encoder 部分. Transformer Encoder 结构如下图所示:
以 Bert 为例, Transformer Encoder 的输入 shape 通常为 (batch_size, seq_len, hidden_size). 在 Encoder 中, 除 Multi-Head Attention 模块外, 其它模块的计算都只在最后一个维度进行, 因此针对这些模块, 可以通过 Packing 减少无效计算; 而 Multi-Head Attention 模块因为需要计算 token 之间的相关性, 在不修改 mask 的情况下, 必须在 Unpacking 之后进行计算, 在 Multi-Head Attention 计算完成之后重新 Packing. 计算流程可以用如下伪代码表示:
packed_input from host
activation = packed_input
for encoer in encoders:
Unpacking
Attention
Packing
Add & LayerNorm
Feed-Forward
Add & LayerNorm
Update activation
Unpacking
unpacked_output to host
5.4.4. 如何使用 Packing
本节以 Bert-Base-Squad 为例进行说明, 本文使用的 OS 为 Ubuntu 20.04, Python 3.8.15. 本文完整示例参考 examples/packed_bert_example.
下载模型
在下载模型之前需要先安装依赖包, 命令如下:
pip install torch==1.10.0
pip install transformers[onnx]==4.25.1
下载模型的命令如下:
python -m transformers.onnx --model=csarron/bert-base-uncased-squad-v1 . --feature question-answering
转换模型
通过上面命令下载的模型, 输入中不包含 position_ids, 而在 IPU 上使用 Packing 的时候, 需要首先在 host 端将输入进行 Pack, 因此需要将 position_ids 加到模型的输入上. 代码如下:
1# Copyright (c) 2023 Graphcore Ltd. All rights reserved.
2import argparse
3import copy
4import os
5
6import onnx
7
8# Download model from huggingface
9# - python -m transformers.onnx --model=csarron/bert-base-uncased-squad-v1 . --feature question-answering
10# reference: https://huggingface.co/csarron/bert-base-uncased-squad-v1
11
12
13if __name__ == '__main__':
14 parser = argparse.ArgumentParser(description='Preprocess Bert-Squad Model')
15 parser.add_argument(
16 '--input_model', type=str, default='', help='path of input model'
17 )
18 args = parser.parse_args()
19
20 if not os.path.exists(args.input_model):
21 parser.print_usage()
22 raise FileNotFoundError(f'Unable to find model : {args.input_model}')
23
24 model = onnx.load(args.input_model)
25
26 # for packed bert, we need to export position_ids to model's input
27 # step 1: remove unneed node
28 rm_node_names = [
29 'Shape_7',
30 'Gather_9',
31 'Add_11',
32 'Unsqueeze_12',
33 'Slice_14',
34 'Constant_8',
35 'Constant_10',
36 'Constant_13',
37 ]
38 rm_nodes = []
39 for node in model.graph.node:
40 if node.name in rm_node_names:
41 rm_nodes.append(node)
42
43 assert len(rm_node_names) == len(rm_nodes)
44
45 for node in rm_nodes:
46 model.graph.node.remove(node)
47
48 # step 2: add position_ids to model's input
49 position_ids = copy.deepcopy(model.graph.input[0])
50 position_ids.name = 'position_ids'
51 model.graph.input.append(position_ids)
52
53 for node in model.graph.node:
54 if node.op_type == 'Gather' and node.name == 'Gather_18':
55 node.input[1] = position_ids.name
56
57 print(f'Save preprocessed model to bert_base_squad_pos.onnx')
58 onnx.save(model, 'bert_base_squad_pos.onnx')
生成不使用 packing 模型:
poprt \
--input_model squad_bert_base_pos.onnx \
--output_model squad_bert_base_bs16_sl256.onnx \
--precision fp16 \
--input_shape input_ids=16,256 attention_mask=16,256 token_type_ids=16,256 position_ids=16,256
生成 packing 模型:
poprt \
--input_model squad_bert_base_pos.onnx \
--output_model squad_bert_base_bs16_sl256_pack.onnx \
--precision fp16 \
--input_shape input_ids=16,256 attention_mask=16,256 token_type_ids=16,256 position_ids=16,256 \
--pack_args max_valid_num=40 segment_max_size=256
其中, max_valid_num 用于指定 Unpacking 之后的最大 batch size, segment_max_size 表示最大的长度.
运行模型
运行模型的命令如下:
python packed_bert_example.py \
--model_with_packing squad_bert_base_bs16_sl256_pack.onnx \
--model_without_packing squad_bert_base_bs16_sl256.onnx
完整的代码如下:
1# Copyright (c) 2023 Graphcore Ltd. All rights reserved.
2import argparse
3import csv
4import os
5import queue
6import sys
7import tempfile
8import time
9
10from multiprocessing.pool import ThreadPool
11
12import numpy as np
13import packing_utils
14
15from sklearn.metrics import mean_absolute_error
16
17from poprt import runtime
18from poprt.backend import get_session
19
20np.random.seed(2023)
21INPUT_IDS = "input_ids"
22POSITION_IDS = "position_ids"
23ATTENTION_MASK = "attention_mask"
24TOKEN_TYPE_IDS = "token_type_ids"
25UNPACK_INFO = "unpack_info"
26OUTPUT2 = "start_logits"
27OUTPUT1 = "end_logits"
28
29
30class BertInputs(object):
31 def __init__(
32 self,
33 input_ids,
34 attention_mask,
35 token_type_ids,
36 position_ids,
37 unpack_info,
38 input_len,
39 ):
40 self.input_ids = input_ids
41 self.attention_mask = attention_mask
42 self.token_type_ids = token_type_ids
43 self.position_ids = position_ids
44 self.input_len = input_len
45 self.unpack_info = unpack_info
46
47
48def get_synthetic_data(args):
49 input_len = np.random.normal(
50 args.avg_seq_len, args.avg_seq_len, size=args.dataset_size
51 ).astype(np.int32)
52 input_len = np.clip(input_len, 1, args.max_seq_len)
53
54 datasets = []
55 for s_len in input_len:
56 input_ids = np.random.randint(0, args.emb_size, (s_len)).astype(np.int32)
57
58 attention_mask = np.ones(s_len).astype(np.int32)
59 token_type_ids = np.random.randint(0, 2, (s_len)).astype(np.int32)
60
61 position_ids = np.arange(s_len).astype(np.int32)
62 unpack_info = np.zeros(args.max_valid_num).astype(np.int32)
63
64 feature = BertInputs(
65 input_ids, attention_mask, token_type_ids, position_ids, unpack_info, s_len
66 )
67 datasets.append(feature)
68
69 return datasets
70
71
72def dump_results(model_name, results):
73 fieldnames = [OUTPUT1, OUTPUT2]
74 filename = os.path.basename(model_name)[:-4] + 'csv'
75 with open(filename, 'w') as f:
76 writer = csv.DictWriter(f, fieldnames=fieldnames)
77 for result in results:
78 dict_name2list = {
79 OUTPUT1: result[OUTPUT1],
80 OUTPUT2: result[OUTPUT2],
81 }
82 writer.writerow(dict_name2list)
83
84
85## create batched inputs and pad samples to max_seq_len
86def padding_data(datasets, index, args):
87 feed_dicts = {}
88 feed_dicts[INPUT_IDS] = np.zeros(
89 (args.batch_size, args.max_seq_len), dtype=np.int32
90 )
91 feed_dicts[ATTENTION_MASK] = np.zeros(
92 (args.batch_size, args.max_seq_len), dtype=np.int32
93 )
94 feed_dicts[POSITION_IDS] = np.zeros(
95 (args.batch_size, args.max_seq_len), dtype=np.int32
96 )
97 feed_dicts[TOKEN_TYPE_IDS] = np.zeros(
98 (args.batch_size, args.max_seq_len), dtype=np.int32
99 )
100
101 for i in range(args.batch_size):
102 input_len = datasets[index].input_len
103 feed_dicts[INPUT_IDS][i][:input_len] = datasets[index].input_ids
104 feed_dicts[ATTENTION_MASK][i][:input_len] = datasets[index].attention_mask
105 feed_dicts[POSITION_IDS][i][:input_len] = datasets[index].position_ids
106 feed_dicts[TOKEN_TYPE_IDS][i][:input_len] = datasets[index].token_type_ids
107 index = index + 1
108 return feed_dicts
109
110
111# online pack, samples feeded to IPU can reach to maximum num of batches in each running turn
112def run_packing_model_with_pack_runner_unpack_repack(args, datasets):
113 tmpdir = tempfile.TemporaryDirectory()
114 # export popef for PackRunner
115 get_session(
116 args.model_with_packing_unpack_repack,
117 1,
118 "poprt",
119 output_dir=tmpdir.name,
120 export_popef=True,
121 ).load()
122 config = runtime.PackRunnerConfig(
123 timeout_microseconds=args.timeout_microseconds,
124 # max_valid_num=args.max_valid_num,
125 # dynamic_input_name=args.dynamic_input_name,
126 )
127
128 popef_path = tmpdir.name + '/executable.popef'
129 # popef_path = "/popconverter/examples/packed_bert_example/executable.popef"
130 pack_runner = runtime.PackRunner(popef_path, config)
131
132 result_queue = queue.Queue()
133 results = []
134 start_time = time.time()
135 for i in range(args.dataset_size):
136 feed_dicts = {
137 INPUT_IDS: datasets[i].input_ids,
138 ATTENTION_MASK: datasets[i].attention_mask,
139 TOKEN_TYPE_IDS: datasets[i].token_type_ids,
140 POSITION_IDS: datasets[i].position_ids,
141 # unpack_info should be hidden from user in the future
142 UNPACK_INFO: np.zeros(args.max_valid_num).astype(np.int32),
143 }
144 out_dict = {
145 OUTPUT1: np.zeros([args.max_seq_len]).astype(np.float16),
146 OUTPUT2: np.zeros([args.max_seq_len]).astype(np.float16),
147 }
148 future = pack_runner.executeAsync(feed_dicts, out_dict)
149 result_queue.put((future, out_dict))
150 result_queue.put((None, None))
151 while True:
152 future, out_dict = result_queue.get()
153 if future == None:
154 break
155 future.wait()
156 results.append(out_dict)
157 end_time = time.time()
158
159 tput = args.dataset_size / (end_time - start_time)
160 latency_ms = (end_time - start_time) / args.dataset_size
161 print(
162 f"[Pack Online Unpack Repack] Throughput: {tput} samples/s, Latency : {latency_ms * 1000} ms"
163 )
164
165 if args.dump_results:
166 dump_results(
167 "online_unpack_repack" + args.model_with_packing_unpack_repack, results
168 )
169
170 tmpdir.cleanup()
171 return results
172
173
174# offline pack, samples feeded to IPU can reach to maximum num of batches in each running turn
175# model with pack / unpack ops
176def run_packing_model_with_model_runner(args, datasets, model_path, across_rows):
177 run_queue = queue.Queue()
178 start_time = time.time()
179 index = 0
180 for i in range(0, args.dataset_size):
181 transfer = packing_utils.pack_data(
182 datasets,
183 index,
184 args.batch_size,
185 seq_len=256,
186 max_valid_num=args.max_valid_num,
187 segment_num=1,
188 across_rows=across_rows,
189 )
190
191 run_queue.put(transfer)
192 index = transfer.count
193 if index == args.dataset_size:
194 break
195 run_queue.put(None)
196 duration_of_packing = time.time() - start_time
197 mean_latency_of_padding_us = duration_of_packing * 1e6 / args.dataset_size
198
199 print(f"Mean latency of packing data: {mean_latency_of_padding_us} us/sam")
200 print(f"Total latency of packing data: {duration_of_packing} s")
201
202 sess = get_session(model_path, 1, "poprt").load()
203
204 pool = ThreadPool(processes=1)
205
206 def execute(feed_dicts, valid_num):
207 outputs = sess.run([OUTPUT1, OUTPUT2], feed_dicts)
208 res = []
209 if across_rows:
210 for i in range(valid_num):
211 res1 = outputs[0][i].copy().tolist()
212 res2 = outputs[1][i].copy().tolist()
213 res.append({OUTPUT1: res1, OUTPUT2: res2})
214 else:
215 outlen = len(outputs[0][0])
216 for index in range(len(feed_dicts[ATTENTION_MASK])):
217 start = 0
218 arr = np.array(feed_dicts[ATTENTION_MASK][index])
219 while start < outlen and arr[start] > 0:
220 arr = arr - 1
221 zero_num = len(arr) - np.count_nonzero(arr)
222 out1 = [0] * outlen
223 out2 = [0] * outlen
224 out1[:zero_num] = outputs[0][index][start : start + zero_num]
225 out2[:zero_num] = outputs[1][index][start : start + zero_num]
226 res.append({OUTPUT1: out1, OUTPUT2: out2})
227 start += zero_num
228 return res
229
230 asy_results = []
231
232 total_start_time = time.time()
233 while True:
234 input_data = run_queue.get()
235 if input_data is None:
236 break
237
238 feed_dicts = {
239 INPUT_IDS: input_data.data[INPUT_IDS],
240 ATTENTION_MASK: input_data.data[ATTENTION_MASK],
241 TOKEN_TYPE_IDS: input_data.data[TOKEN_TYPE_IDS],
242 POSITION_IDS: input_data.data[POSITION_IDS],
243 # unpack_info should be hidden from user in the future
244 UNPACK_INFO: input_data.unpack_info,
245 }
246 if not across_rows:
247 feed_dicts.pop(UNPACK_INFO)
248
249 valid_num = len(input_data.specs)
250 async_result = pool.apply_async(execute, (feed_dicts, valid_num))
251 asy_results.append(async_result)
252
253 results = []
254 for asy in asy_results:
255 for res in asy.get():
256 results.append(res)
257 total_end_time = time.time()
258
259 tput = len(results) / (total_end_time - total_start_time)
260 latency = (total_end_time - total_start_time) / len(results)
261 if across_rows:
262 print(
263 f"[Pack Offline Unpack Repack] Throughput: {tput} samples/s, Latency: {latency*1000} ms"
264 )
265 else:
266 print(
267 f"[Pack Offline AttentionMask] Throughput: {tput} samples/s, Latency: {latency*1000} ms"
268 )
269
270 if args.dump_results:
271 dump_results("offline_" + model_path, results)
272
273 return results
274
275
276# online pack, samples feeded to IPU can reach to maximum num of batches in each running turn
277# model only add AttentionMask op in this mode
278def run_packing_model_with_pack_runner_attention_mask(args, datasets, algo):
279 tmpdir = tempfile.TemporaryDirectory()
280 # export popef for PackRunner
281 get_session(
282 args.model_with_packing_attention_mask,
283 1,
284 "poprt",
285 output_dir=tmpdir.name,
286 export_popef=True,
287 ).load()
288 config = runtime.PackRunnerConfig(
289 timeout_microseconds=args.timeout_microseconds,
290 max_valid_num=args.max_valid_num,
291 dynamic_input_name=args.dynamic_input_name,
292 )
293
294 if algo == "next_fit":
295 config.algorithom = runtime.PackAlgorithm.next_fit
296 else:
297 config.algorithom = runtime.PackAlgorithm.first_fit
298
299 config.enable_input_single_row_mode("attention_mask")
300 popef_path = tmpdir.name + '/executable.popef'
301 # popef_path = "/popconverter/examples/packed_bert_example/executable.popef"
302 pack_runner = runtime.PackRunner(popef_path, config)
303
304 result_queue = queue.Queue()
305 results = []
306 start_time = time.time()
307 for i in range(args.dataset_size):
308 feed_dicts = {
309 INPUT_IDS: datasets[i].input_ids,
310 ATTENTION_MASK: datasets[i].attention_mask,
311 TOKEN_TYPE_IDS: datasets[i].token_type_ids,
312 POSITION_IDS: datasets[i].position_ids,
313 }
314 out_dict = {
315 OUTPUT1: np.zeros([args.max_seq_len]).astype(np.float16),
316 OUTPUT2: np.zeros([args.max_seq_len]).astype(np.float16),
317 }
318 future = pack_runner.executeAsync(feed_dicts, out_dict)
319 result_queue.put((future, out_dict))
320 result_queue.put((None, None))
321 while True:
322 future, out_dict = result_queue.get()
323 if future == None:
324 break
325 future.wait()
326 results.append(out_dict)
327 end_time = time.time()
328
329 tput = args.dataset_size / (end_time - start_time)
330 latency_ms = (end_time - start_time) / args.dataset_size
331 print(
332 f"[Pack Online AttentionMask({algo})] Throughput: {tput} samples/s, Latency : {latency_ms * 1000} ms"
333 )
334
335 if args.dump_results:
336 dump_results(
337 "online_attention_mask_"
338 + algo
339 + "_"
340 + args.model_with_packing_attention_mask,
341 results,
342 )
343
344 tmpdir.cleanup()
345 return results
346
347
348# no pack, padding each line with 0 if input length is not long enough.
349# samples num equals to batch at every running turn
350def run_original_model_with_model_runner(args, datasets):
351 run_queue = queue.Queue()
352 start_time = time.time()
353 for i in range(0, args.dataset_size, args.batch_size):
354 feed_dicts = padding_data(datasets, i, args)
355 run_queue.put((args.batch_size, feed_dicts))
356 run_queue.put((0, None))
357 duration_of_padding_s = time.time() - start_time
358
359 mean_latency_of_padding_us = duration_of_padding_s * 1e6 / args.dataset_size
360 print(f"Mean latency of padding data: {mean_latency_of_padding_us} us/sam")
361 print(f"Total latency of padding data: {duration_of_padding_s} s")
362
363 sess = get_session(args.model_without_packing, 1, "poprt").load()
364
365 asy_results = []
366
367 def execute(feed_dicts, valid_num):
368 outputs = sess.run([OUTPUT1, OUTPUT2], feed_dicts)
369 res = []
370 for i in range(valid_num):
371 res1 = outputs[0][i].copy().tolist()
372 res2 = outputs[1][i].copy().tolist()
373 res.append({OUTPUT1: res1, OUTPUT2: res2})
374 return res
375
376 # execute
377 pool = ThreadPool(processes=1)
378 total_start_time = time.time()
379 while True:
380 valid_num, feed_dicts = run_queue.get()
381 if feed_dicts is None:
382 break
383 async_result = pool.apply_async(execute, (feed_dicts, valid_num))
384 asy_results.append(async_result)
385 results = []
386 for asy in asy_results:
387 for res in asy.get():
388 results.append(res)
389 total_end_time = time.time()
390
391 tput = len(results) / (total_end_time - total_start_time)
392 latency = (total_end_time - total_start_time) / len(results)
393
394 if args.dump_results:
395 dump_results("original_" + args.model_without_packing, results)
396
397 print(f"[Original] Throughput: {tput} samples/s, Latency: {latency *1000} ms")
398
399 return results
400
401
402def calculate_mae(expected_results, output_results, datasets, enable_debug):
403 assert len(datasets) == len(expected_results)
404 assert len(datasets) == len(output_results)
405 maes = []
406 zipped_data = zip(datasets, expected_results, output_results)
407 for i, (data, expected, output) in enumerate(zipped_data):
408 np.testing.assert_equal(len(expected), len(output))
409 input_len = data.input_len
410 output_1_mae = mean_absolute_error(
411 expected[OUTPUT1][:input_len], output[OUTPUT1][:input_len]
412 )
413 output_2_mae = mean_absolute_error(
414 expected[OUTPUT2][:input_len], output[OUTPUT2][:input_len]
415 )
416 maes.append([i, output_1_mae, output_2_mae])
417
418 k = 10 if len(datasets) > 10 else len(datasets)
419
420 def print_topk(k, out_name, out_index):
421 for i in range(1, k + 1):
422 print(f"Sample: {maes[-i][0]}, {out_name} mae : {maes[-i][out_index]}")
423
424 if enable_debug:
425 maes.sort(key=lambda e: e[1])
426 print(f"\n***** Top {k} mae of output: {OUTPUT1} *****")
427 print_topk(k, OUTPUT1, 1)
428
429 maes.sort(key=lambda e: e[2])
430 print(f"\n***** Top {k} mae of output: {OUTPUT2} *****")
431 print_topk(k, OUTPUT2, 2)
432
433 print(f"{OUTPUT1} average mae: {np.mean(maes,axis=0)[1]}")
434 print(f"{OUTPUT2} average mae: {np.mean(maes,axis=0)[2]}")
435
436
437def main():
438 parser = argparse.ArgumentParser(description='packed bert-base-squad')
439 parser.add_argument(
440 '--avg_seq_len', type=int, default=128, help='average sequence length of input'
441 )
442 parser.add_argument(
443 '--batch_size', type=int, default=16, help='batch size of model'
444 )
445 parser.add_argument('--dump_results', action='store_true', help='dump results')
446 parser.add_argument(
447 '--dynamic_input_name', type=str, default=INPUT_IDS, help='dynamic input name'
448 )
449 parser.add_argument(
450 '--emb_size', type=int, default=30522, help='word embedding table size'
451 )
452 parser.add_argument(
453 '--enable_debug', action='store_true', help='enable output debug info'
454 )
455 parser.add_argument(
456 '--iterations', type=int, default=100, help='number of batches to run'
457 )
458 parser.add_argument(
459 '--max_seq_len', type=int, default=256, help='max sequence length of input'
460 )
461 parser.add_argument(
462 '--max_valid_num', type=int, default=40, help='max valid num for pack'
463 )
464 parser.add_argument(
465 '--model_without_packing', help='model without pack, unpack, repack op'
466 )
467 parser.add_argument(
468 '--model_with_packing_unpack_repack',
469 help='model with pack, unpack, repack op converted by PopRT',
470 )
471 parser.add_argument(
472 '--model_with_packing_attention_mask',
473 help='model with AttentionMask op converted by PopRT',
474 )
475 parser.add_argument(
476 '--timeout_microseconds',
477 type=int,
478 default=15000,
479 help='timeout in microseconds',
480 )
481
482 args = parser.parse_args()
483 args.dataset_size = args.iterations * args.batch_size
484
485 # generate synthetic dataset
486 datasets = get_synthetic_data(args)
487 original_result = run_original_model_with_model_runner(args, datasets)
488
489 offline_pack_result_unpack_repack = run_packing_model_with_model_runner(
490 args, datasets, args.model_with_packing_unpack_repack, True
491 )
492 online_pack_result_unpack_repack = run_packing_model_with_pack_runner_unpack_repack(
493 args, datasets
494 )
495
496 offline_pack_result_attention_mask = run_packing_model_with_model_runner(
497 args, datasets, args.model_with_packing_attention_mask, False
498 )
499 online_pack_result_attention_mask_first_fit = (
500 run_packing_model_with_pack_runner_attention_mask(args, datasets, "first_fit")
501 )
502 online_pack_result_attention_mask_next_fit = (
503 run_packing_model_with_pack_runner_attention_mask(args, datasets, "next_fit")
504 )
505 # compare the results
506 print("\nCompare results between original and online pack(with unpack repack)")
507 calculate_mae(
508 original_result, online_pack_result_unpack_repack, datasets, args.enable_debug
509 )
510 print("\nCompare results between offline and online pack with unpack repack op")
511 calculate_mae(
512 offline_pack_result_unpack_repack,
513 online_pack_result_unpack_repack,
514 datasets,
515 args.enable_debug,
516 )
517
518 print(
519 "\nCompare results between original and online_first_fit pack with attention_mask op"
520 )
521 calculate_mae(
522 original_result,
523 online_pack_result_attention_mask_first_fit,
524 datasets,
525 args.enable_debug,
526 )
527 print(
528 "\nCompare results between original and online_next_fit pack with attention_mask op"
529 )
530 calculate_mae(
531 original_result,
532 online_pack_result_attention_mask_next_fit,
533 datasets,
534 args.enable_debug,
535 )
536
537 print(
538 "\nCompare results between offline and online_next_fit pack with attenttion_mask op"
539 )
540 calculate_mae(
541 offline_pack_result_attention_mask,
542 online_pack_result_attention_mask_next_fit,
543 datasets,
544 args.enable_debug,
545 )
546
547
548if __name__ == "__main__":
549 sys.exit(main())
运行完成后, 将输出类似如下信息:
[Original] Throughput: 1860.9792005501781 samples/s, Latency: 0.5373515188694 ms
....
[Pack Offline] Throughput: 2830.8140869025283 samples/s, Latency: 0.3532552719116211 ms
....
[Pack Online] Throughput: 2782.587696947809 samples/s, Latency : 0.3593777120113373 ms
....