5.4. 使用 Packing

5.4.1. 背景

当前 IPU 只支持静态图, 模型的输入 shape 需要是固定的, 动态 shape 会导致模型重新编译. 但在实际应用中, 尤其是自然语言处理类型的应用, 模型输入 sequence length 往往是动态的. 这种情况下, 常规的处理方法是将这些变长数据都先 pad 到 max sequence length, 然后再输入到模型. 然而这种方法会带来很多无效计算, 导致算力的实际利用率低下. 在 IPU 上, 可以使用 Packing 来支持 dynamic sequence length, 提高算力利用率.

5.4.2. Packing 及 Unpacking

这里通过例子来说明什么是 Packing 及 Unpacking. 假设模型输入长度最大是 8, batch size 是 4, 当前有 7 个不同长度的 batch size 为 1 的 request, 长度从 1 到 7, 0 表示 pad 的无效数据, 则 Packing 及 Unpacking 如下图所示:

../_images/packing_unpacking.png

Fig. 5.5 Packing 及 Unpacking

5.4.3. Transformer-based NLP Models

自 2017 年被提出以来, Transformer 结构应用领域不断扩展, 从最初的 NLP 扩展到今天的 ASR/CV/DLRM 等领域. Transformer 包含 Encoder 和 Decoder 部分, 本文只关注 Encoder 部分. Transformer Encoder 结构如下图所示:

../_images/transformer_encoder.png

Fig. 5.6 Transformer Encoder

以 Bert 为例, Transformer Encoder 的输入 shape 通常为 (batch_size, seq_len, hidden_size). 在 Encoder 中, 除 Multi-Head Attention 模块外, 其它模块的计算都只在最后一个维度进行, 因此针对这些模块, 可以通过 Packing 减少无效计算; 而 Multi-Head Attention 模块因为需要计算 token 之间的相关性, 在不修改 mask 的情况下, 必须在 Unpacking 之后进行计算, 在 Multi-Head Attention 计算完成之后重新 Packing. 计算流程可以用如下伪代码表示:

packed_input from host
activation = packed_input
for encoer in encoders:
    Unpacking
    Attention
    Packing
    Add & LayerNorm
    Feed-Forward
    Add & LayerNorm
    Update activation
Unpacking
unpacked_output to host

5.4.4. 如何使用 Packing

本节以 Bert-Base-Squad 为例进行说明, 本文使用的 OS 为 Ubuntu 20.04, Python 3.8.15. 本文完整示例参考 examples/packed_bert_example.

下载模型

在下载模型之前需要先安装依赖包, 命令如下:

pip install torch==1.10.0
pip install transformers[onnx]==4.25.1

下载模型的命令如下:

python -m transformers.onnx --model=csarron/bert-base-uncased-squad-v1 . --feature question-answering

转换模型

通过上面命令下载的模型, 输入中不包含 position_ids, 而在 IPU 上使用 Packing 的时候, 需要首先在 host 端将输入进行 Pack, 因此需要将 position_ids 加到模型的输入上. 代码如下:

Listing 5.3 add_position_ids.py
 1# Copyright (c) 2023 Graphcore Ltd. All rights reserved.
 2import argparse
 3import copy
 4import os
 5
 6import onnx
 7
 8# Download model from huggingface
 9# - python -m transformers.onnx --model=csarron/bert-base-uncased-squad-v1 . --feature question-answering
10# reference: https://huggingface.co/csarron/bert-base-uncased-squad-v1
11
12
13if __name__ == '__main__':
14    parser = argparse.ArgumentParser(description='Preprocess Bert-Squad Model')
15    parser.add_argument(
16        '--input_model', type=str, default='', help='path of input model'
17    )
18    args = parser.parse_args()
19
20    if not os.path.exists(args.input_model):
21        parser.print_usage()
22        raise FileNotFoundError(f'Unable to find model : {args.input_model}')
23
24    model = onnx.load(args.input_model)
25
26    # for packed bert, we need to export position_ids to model's input
27    # step 1: remove unneed node
28    rm_node_names = [
29        'Shape_7',
30        'Gather_9',
31        'Add_11',
32        'Unsqueeze_12',
33        'Slice_14',
34        'Constant_8',
35        'Constant_10',
36        'Constant_13',
37    ]
38    rm_nodes = []
39    for node in model.graph.node:
40        if node.name in rm_node_names:
41            rm_nodes.append(node)
42
43    assert len(rm_node_names) == len(rm_nodes)
44
45    for node in rm_nodes:
46        model.graph.node.remove(node)
47
48    # step 2: add position_ids to model's input
49    position_ids = copy.deepcopy(model.graph.input[0])
50    position_ids.name = 'position_ids'
51    model.graph.input.append(position_ids)
52
53    for node in model.graph.node:
54        if node.op_type == 'Gather' and node.name == 'Gather_18':
55            node.input[1] = position_ids.name
56
57    print(f'Save preprocessed model to bert_base_squad_pos.onnx')
58    onnx.save(model, 'bert_base_squad_pos.onnx')

Download add_position_ids.py

生成不使用 packing 模型:

poprt \
    --input_model squad_bert_base_pos.onnx \
    --output_model squad_bert_base_bs16_sl256.onnx \
    --precision fp16 \
    --input_shape input_ids=16,256 attention_mask=16,256 token_type_ids=16,256 position_ids=16,256

生成 packing 模型:

poprt \
    --input_model squad_bert_base_pos.onnx \
    --output_model squad_bert_base_bs16_sl256_pack.onnx \
    --precision fp16 \
    --input_shape input_ids=16,256 attention_mask=16,256 token_type_ids=16,256 position_ids=16,256 \
    --pack_args max_valid_num=40 segment_max_size=256

其中, max_valid_num 用于指定 Unpacking 之后的最大 batch size, segment_max_size 表示最大的长度.

运行模型

运行模型的命令如下:

python packed_bert_example.py \
    --model_with_packing squad_bert_base_bs16_sl256_pack.onnx \
    --model_without_packing squad_bert_base_bs16_sl256.onnx

完整的代码如下:

Listing 5.4 packed_bert_example.py
  1# Copyright (c) 2023 Graphcore Ltd. All rights reserved.
  2import argparse
  3import csv
  4import os
  5import queue
  6import sys
  7import tempfile
  8import time
  9
 10from multiprocessing.pool import ThreadPool
 11
 12import numpy as np
 13import packing_utils
 14
 15from sklearn.metrics import mean_absolute_error
 16
 17from poprt import runtime
 18from poprt.backend import get_session
 19
 20np.random.seed(2023)
 21INPUT_IDS = "input_ids"
 22POSITION_IDS = "position_ids"
 23ATTENTION_MASK = "attention_mask"
 24TOKEN_TYPE_IDS = "token_type_ids"
 25UNPACK_INFO = "unpack_info"
 26OUTPUT2 = "start_logits"
 27OUTPUT1 = "end_logits"
 28
 29
 30class BertInputs(object):
 31    def __init__(
 32        self,
 33        input_ids,
 34        attention_mask,
 35        token_type_ids,
 36        position_ids,
 37        unpack_info,
 38        input_len,
 39    ):
 40        self.input_ids = input_ids
 41        self.attention_mask = attention_mask
 42        self.token_type_ids = token_type_ids
 43        self.position_ids = position_ids
 44        self.input_len = input_len
 45        self.unpack_info = unpack_info
 46
 47
 48def get_synthetic_data(args):
 49    input_len = np.random.normal(
 50        args.avg_seq_len, args.avg_seq_len, size=args.dataset_size
 51    ).astype(np.int32)
 52    input_len = np.clip(input_len, 1, args.max_seq_len)
 53
 54    datasets = []
 55    for s_len in input_len:
 56        input_ids = np.random.randint(0, args.emb_size, (s_len)).astype(np.int32)
 57
 58        attention_mask = np.ones(s_len).astype(np.int32)
 59        token_type_ids = np.random.randint(0, 2, (s_len)).astype(np.int32)
 60
 61        position_ids = np.arange(s_len).astype(np.int32)
 62        unpack_info = np.zeros(args.max_valid_num).astype(np.int32)
 63
 64        feature = BertInputs(
 65            input_ids, attention_mask, token_type_ids, position_ids, unpack_info, s_len
 66        )
 67        datasets.append(feature)
 68
 69    return datasets
 70
 71
 72def dump_results(model_name, results):
 73    fieldnames = [OUTPUT1, OUTPUT2]
 74    filename = os.path.basename(model_name)[:-4] + 'csv'
 75    with open(filename, 'w') as f:
 76        writer = csv.DictWriter(f, fieldnames=fieldnames)
 77        for result in results:
 78            dict_name2list = {
 79                OUTPUT1: result[OUTPUT1],
 80                OUTPUT2: result[OUTPUT2],
 81            }
 82            writer.writerow(dict_name2list)
 83
 84
 85## create batched inputs and pad samples to max_seq_len
 86def padding_data(datasets, index, args):
 87    feed_dicts = {}
 88    feed_dicts[INPUT_IDS] = np.zeros(
 89        (args.batch_size, args.max_seq_len), dtype=np.int32
 90    )
 91    feed_dicts[ATTENTION_MASK] = np.zeros(
 92        (args.batch_size, args.max_seq_len), dtype=np.int32
 93    )
 94    feed_dicts[POSITION_IDS] = np.zeros(
 95        (args.batch_size, args.max_seq_len), dtype=np.int32
 96    )
 97    feed_dicts[TOKEN_TYPE_IDS] = np.zeros(
 98        (args.batch_size, args.max_seq_len), dtype=np.int32
 99    )
100
101    for i in range(args.batch_size):
102        input_len = datasets[index].input_len
103        feed_dicts[INPUT_IDS][i][:input_len] = datasets[index].input_ids
104        feed_dicts[ATTENTION_MASK][i][:input_len] = datasets[index].attention_mask
105        feed_dicts[POSITION_IDS][i][:input_len] = datasets[index].position_ids
106        feed_dicts[TOKEN_TYPE_IDS][i][:input_len] = datasets[index].token_type_ids
107        index = index + 1
108    return feed_dicts
109
110
111# online pack, samples feeded to IPU can reach to maximum num of batches in each running turn
112def run_packing_model_with_pack_runner_unpack_repack(args, datasets):
113    tmpdir = tempfile.TemporaryDirectory()
114    # export popef for PackRunner
115    get_session(
116        args.model_with_packing_unpack_repack,
117        1,
118        "poprt",
119        output_dir=tmpdir.name,
120        export_popef=True,
121    ).load()
122    config = runtime.PackRunnerConfig(
123        timeout_microseconds=args.timeout_microseconds,
124        # max_valid_num=args.max_valid_num,
125        # dynamic_input_name=args.dynamic_input_name,
126    )
127
128    popef_path = tmpdir.name + '/executable.popef'
129    # popef_path = "/popconverter/examples/packed_bert_example/executable.popef"
130    pack_runner = runtime.PackRunner(popef_path, config)
131
132    result_queue = queue.Queue()
133    results = []
134    start_time = time.time()
135    for i in range(args.dataset_size):
136        feed_dicts = {
137            INPUT_IDS: datasets[i].input_ids,
138            ATTENTION_MASK: datasets[i].attention_mask,
139            TOKEN_TYPE_IDS: datasets[i].token_type_ids,
140            POSITION_IDS: datasets[i].position_ids,
141            # unpack_info should be hidden from user in the future
142            UNPACK_INFO: np.zeros(args.max_valid_num).astype(np.int32),
143        }
144        out_dict = {
145            OUTPUT1: np.zeros([args.max_seq_len]).astype(np.float16),
146            OUTPUT2: np.zeros([args.max_seq_len]).astype(np.float16),
147        }
148        future = pack_runner.executeAsync(feed_dicts, out_dict)
149        result_queue.put((future, out_dict))
150    result_queue.put((None, None))
151    while True:
152        future, out_dict = result_queue.get()
153        if future == None:
154            break
155        future.wait()
156        results.append(out_dict)
157    end_time = time.time()
158
159    tput = args.dataset_size / (end_time - start_time)
160    latency_ms = (end_time - start_time) / args.dataset_size
161    print(
162        f"[Pack Online Unpack Repack] Throughput: {tput} samples/s, Latency : {latency_ms * 1000} ms"
163    )
164
165    if args.dump_results:
166        dump_results(
167            "online_unpack_repack" + args.model_with_packing_unpack_repack, results
168        )
169
170    tmpdir.cleanup()
171    return results
172
173
174# offline pack, samples feeded to IPU can reach to maximum num of batches in each running turn
175# model with pack / unpack ops
176def run_packing_model_with_model_runner(args, datasets, model_path, across_rows):
177    run_queue = queue.Queue()
178    start_time = time.time()
179    index = 0
180    for i in range(0, args.dataset_size):
181        transfer = packing_utils.pack_data(
182            datasets,
183            index,
184            args.batch_size,
185            seq_len=256,
186            max_valid_num=args.max_valid_num,
187            segment_num=1,
188            across_rows=across_rows,
189        )
190
191        run_queue.put(transfer)
192        index = transfer.count
193        if index == args.dataset_size:
194            break
195    run_queue.put(None)
196    duration_of_packing = time.time() - start_time
197    mean_latency_of_padding_us = duration_of_packing * 1e6 / args.dataset_size
198
199    print(f"Mean latency of packing data: {mean_latency_of_padding_us} us/sam")
200    print(f"Total latency of packing data: {duration_of_packing} s")
201
202    sess = get_session(model_path, 1, "poprt").load()
203
204    pool = ThreadPool(processes=1)
205
206    def execute(feed_dicts, valid_num):
207        outputs = sess.run([OUTPUT1, OUTPUT2], feed_dicts)
208        res = []
209        if across_rows:
210            for i in range(valid_num):
211                res1 = outputs[0][i].copy().tolist()
212                res2 = outputs[1][i].copy().tolist()
213                res.append({OUTPUT1: res1, OUTPUT2: res2})
214        else:
215            outlen = len(outputs[0][0])
216            for index in range(len(feed_dicts[ATTENTION_MASK])):
217                start = 0
218                arr = np.array(feed_dicts[ATTENTION_MASK][index])
219                while start < outlen and arr[start] > 0:
220                    arr = arr - 1
221                    zero_num = len(arr) - np.count_nonzero(arr)
222                    out1 = [0] * outlen
223                    out2 = [0] * outlen
224                    out1[:zero_num] = outputs[0][index][start : start + zero_num]
225                    out2[:zero_num] = outputs[1][index][start : start + zero_num]
226                    res.append({OUTPUT1: out1, OUTPUT2: out2})
227                    start += zero_num
228        return res
229
230    asy_results = []
231
232    total_start_time = time.time()
233    while True:
234        input_data = run_queue.get()
235        if input_data is None:
236            break
237
238        feed_dicts = {
239            INPUT_IDS: input_data.data[INPUT_IDS],
240            ATTENTION_MASK: input_data.data[ATTENTION_MASK],
241            TOKEN_TYPE_IDS: input_data.data[TOKEN_TYPE_IDS],
242            POSITION_IDS: input_data.data[POSITION_IDS],
243            # unpack_info should be hidden from user in the future
244            UNPACK_INFO: input_data.unpack_info,
245        }
246        if not across_rows:
247            feed_dicts.pop(UNPACK_INFO)
248
249        valid_num = len(input_data.specs)
250        async_result = pool.apply_async(execute, (feed_dicts, valid_num))
251        asy_results.append(async_result)
252
253    results = []
254    for asy in asy_results:
255        for res in asy.get():
256            results.append(res)
257    total_end_time = time.time()
258
259    tput = len(results) / (total_end_time - total_start_time)
260    latency = (total_end_time - total_start_time) / len(results)
261    if across_rows:
262        print(
263            f"[Pack Offline Unpack Repack] Throughput: {tput} samples/s, Latency: {latency*1000} ms"
264        )
265    else:
266        print(
267            f"[Pack Offline AttentionMask] Throughput: {tput} samples/s, Latency: {latency*1000} ms"
268        )
269
270    if args.dump_results:
271        dump_results("offline_" + model_path, results)
272
273    return results
274
275
276# online pack, samples feeded to IPU can reach to maximum num of batches in each running turn
277# model only add AttentionMask op in this mode
278def run_packing_model_with_pack_runner_attention_mask(args, datasets, algo):
279    tmpdir = tempfile.TemporaryDirectory()
280    # export popef for PackRunner
281    get_session(
282        args.model_with_packing_attention_mask,
283        1,
284        "poprt",
285        output_dir=tmpdir.name,
286        export_popef=True,
287    ).load()
288    config = runtime.PackRunnerConfig(
289        timeout_microseconds=args.timeout_microseconds,
290        max_valid_num=args.max_valid_num,
291        dynamic_input_name=args.dynamic_input_name,
292    )
293
294    if algo == "next_fit":
295        config.algorithom = runtime.PackAlgorithm.next_fit
296    else:
297        config.algorithom = runtime.PackAlgorithm.first_fit
298
299    config.enable_input_single_row_mode("attention_mask")
300    popef_path = tmpdir.name + '/executable.popef'
301    # popef_path = "/popconverter/examples/packed_bert_example/executable.popef"
302    pack_runner = runtime.PackRunner(popef_path, config)
303
304    result_queue = queue.Queue()
305    results = []
306    start_time = time.time()
307    for i in range(args.dataset_size):
308        feed_dicts = {
309            INPUT_IDS: datasets[i].input_ids,
310            ATTENTION_MASK: datasets[i].attention_mask,
311            TOKEN_TYPE_IDS: datasets[i].token_type_ids,
312            POSITION_IDS: datasets[i].position_ids,
313        }
314        out_dict = {
315            OUTPUT1: np.zeros([args.max_seq_len]).astype(np.float16),
316            OUTPUT2: np.zeros([args.max_seq_len]).astype(np.float16),
317        }
318        future = pack_runner.executeAsync(feed_dicts, out_dict)
319        result_queue.put((future, out_dict))
320    result_queue.put((None, None))
321    while True:
322        future, out_dict = result_queue.get()
323        if future == None:
324            break
325        future.wait()
326        results.append(out_dict)
327    end_time = time.time()
328
329    tput = args.dataset_size / (end_time - start_time)
330    latency_ms = (end_time - start_time) / args.dataset_size
331    print(
332        f"[Pack Online AttentionMask({algo})] Throughput: {tput} samples/s, Latency : {latency_ms * 1000} ms"
333    )
334
335    if args.dump_results:
336        dump_results(
337            "online_attention_mask_"
338            + algo
339            + "_"
340            + args.model_with_packing_attention_mask,
341            results,
342        )
343
344    tmpdir.cleanup()
345    return results
346
347
348# no pack, padding each line with 0 if input length is not long enough.
349# samples num equals to batch at every running turn
350def run_original_model_with_model_runner(args, datasets):
351    run_queue = queue.Queue()
352    start_time = time.time()
353    for i in range(0, args.dataset_size, args.batch_size):
354        feed_dicts = padding_data(datasets, i, args)
355        run_queue.put((args.batch_size, feed_dicts))
356    run_queue.put((0, None))
357    duration_of_padding_s = time.time() - start_time
358
359    mean_latency_of_padding_us = duration_of_padding_s * 1e6 / args.dataset_size
360    print(f"Mean latency of padding data: {mean_latency_of_padding_us} us/sam")
361    print(f"Total latency of padding data: {duration_of_padding_s} s")
362
363    sess = get_session(args.model_without_packing, 1, "poprt").load()
364
365    asy_results = []
366
367    def execute(feed_dicts, valid_num):
368        outputs = sess.run([OUTPUT1, OUTPUT2], feed_dicts)
369        res = []
370        for i in range(valid_num):
371            res1 = outputs[0][i].copy().tolist()
372            res2 = outputs[1][i].copy().tolist()
373            res.append({OUTPUT1: res1, OUTPUT2: res2})
374        return res
375
376    # execute
377    pool = ThreadPool(processes=1)
378    total_start_time = time.time()
379    while True:
380        valid_num, feed_dicts = run_queue.get()
381        if feed_dicts is None:
382            break
383        async_result = pool.apply_async(execute, (feed_dicts, valid_num))
384        asy_results.append(async_result)
385    results = []
386    for asy in asy_results:
387        for res in asy.get():
388            results.append(res)
389    total_end_time = time.time()
390
391    tput = len(results) / (total_end_time - total_start_time)
392    latency = (total_end_time - total_start_time) / len(results)
393
394    if args.dump_results:
395        dump_results("original_" + args.model_without_packing, results)
396
397    print(f"[Original] Throughput: {tput} samples/s, Latency: {latency *1000} ms")
398
399    return results
400
401
402def calculate_mae(expected_results, output_results, datasets, enable_debug):
403    assert len(datasets) == len(expected_results)
404    assert len(datasets) == len(output_results)
405    maes = []
406    zipped_data = zip(datasets, expected_results, output_results)
407    for i, (data, expected, output) in enumerate(zipped_data):
408        np.testing.assert_equal(len(expected), len(output))
409        input_len = data.input_len
410        output_1_mae = mean_absolute_error(
411            expected[OUTPUT1][:input_len], output[OUTPUT1][:input_len]
412        )
413        output_2_mae = mean_absolute_error(
414            expected[OUTPUT2][:input_len], output[OUTPUT2][:input_len]
415        )
416        maes.append([i, output_1_mae, output_2_mae])
417
418    k = 10 if len(datasets) > 10 else len(datasets)
419
420    def print_topk(k, out_name, out_index):
421        for i in range(1, k + 1):
422            print(f"Sample: {maes[-i][0]}, {out_name} mae : {maes[-i][out_index]}")
423
424    if enable_debug:
425        maes.sort(key=lambda e: e[1])
426        print(f"\n***** Top {k} mae of output: {OUTPUT1} *****")
427        print_topk(k, OUTPUT1, 1)
428
429        maes.sort(key=lambda e: e[2])
430        print(f"\n***** Top {k} mae of output: {OUTPUT2} *****")
431        print_topk(k, OUTPUT2, 2)
432
433    print(f"{OUTPUT1} average mae: {np.mean(maes,axis=0)[1]}")
434    print(f"{OUTPUT2} average mae: {np.mean(maes,axis=0)[2]}")
435
436
437def main():
438    parser = argparse.ArgumentParser(description='packed bert-base-squad')
439    parser.add_argument(
440        '--avg_seq_len', type=int, default=128, help='average sequence length of input'
441    )
442    parser.add_argument(
443        '--batch_size', type=int, default=16, help='batch size of model'
444    )
445    parser.add_argument('--dump_results', action='store_true', help='dump results')
446    parser.add_argument(
447        '--dynamic_input_name', type=str, default=INPUT_IDS, help='dynamic input name'
448    )
449    parser.add_argument(
450        '--emb_size', type=int, default=30522, help='word embedding table size'
451    )
452    parser.add_argument(
453        '--enable_debug', action='store_true', help='enable output debug info'
454    )
455    parser.add_argument(
456        '--iterations', type=int, default=100, help='number of batches to run'
457    )
458    parser.add_argument(
459        '--max_seq_len', type=int, default=256, help='max sequence length of input'
460    )
461    parser.add_argument(
462        '--max_valid_num', type=int, default=40, help='max valid num for pack'
463    )
464    parser.add_argument(
465        '--model_without_packing', help='model without pack, unpack, repack op'
466    )
467    parser.add_argument(
468        '--model_with_packing_unpack_repack',
469        help='model with pack, unpack, repack op converted by PopRT',
470    )
471    parser.add_argument(
472        '--model_with_packing_attention_mask',
473        help='model with AttentionMask op converted by PopRT',
474    )
475    parser.add_argument(
476        '--timeout_microseconds',
477        type=int,
478        default=15000,
479        help='timeout in microseconds',
480    )
481
482    args = parser.parse_args()
483    args.dataset_size = args.iterations * args.batch_size
484
485    # generate synthetic dataset
486    datasets = get_synthetic_data(args)
487    original_result = run_original_model_with_model_runner(args, datasets)
488
489    offline_pack_result_unpack_repack = run_packing_model_with_model_runner(
490        args, datasets, args.model_with_packing_unpack_repack, True
491    )
492    online_pack_result_unpack_repack = run_packing_model_with_pack_runner_unpack_repack(
493        args, datasets
494    )
495
496    offline_pack_result_attention_mask = run_packing_model_with_model_runner(
497        args, datasets, args.model_with_packing_attention_mask, False
498    )
499    online_pack_result_attention_mask_first_fit = (
500        run_packing_model_with_pack_runner_attention_mask(args, datasets, "first_fit")
501    )
502    online_pack_result_attention_mask_next_fit = (
503        run_packing_model_with_pack_runner_attention_mask(args, datasets, "next_fit")
504    )
505    # compare the results
506    print("\nCompare results between original and online pack(with unpack repack)")
507    calculate_mae(
508        original_result, online_pack_result_unpack_repack, datasets, args.enable_debug
509    )
510    print("\nCompare results between offline and online pack with unpack repack op")
511    calculate_mae(
512        offline_pack_result_unpack_repack,
513        online_pack_result_unpack_repack,
514        datasets,
515        args.enable_debug,
516    )
517
518    print(
519        "\nCompare results between original and online_first_fit pack with attention_mask op"
520    )
521    calculate_mae(
522        original_result,
523        online_pack_result_attention_mask_first_fit,
524        datasets,
525        args.enable_debug,
526    )
527    print(
528        "\nCompare results between original and online_next_fit pack with attention_mask op"
529    )
530    calculate_mae(
531        original_result,
532        online_pack_result_attention_mask_next_fit,
533        datasets,
534        args.enable_debug,
535    )
536
537    print(
538        "\nCompare results between offline and online_next_fit pack with attenttion_mask op"
539    )
540    calculate_mae(
541        offline_pack_result_attention_mask,
542        online_pack_result_attention_mask_next_fit,
543        datasets,
544        args.enable_debug,
545    )
546
547
548if __name__ == "__main__":
549    sys.exit(main())

Download add_position_ids.py

运行完成后, 将输出类似如下信息:

[Original] Throughput: 1860.9792005501781 samples/s, Latency: 0.5373515188694 ms
....
[Pack Offline] Throughput: 2830.8140869025283 samples/s, Latency: 0.3532552719116211 ms
....
[Pack Online] Throughput: 2782.587696947809 samples/s, Latency : 0.3593777120113373 ms
....