5.5. Packing

5.5.1. 背景

当前 IPU 只支持静态图, 模型的输入 shape 需要是固定的, 动态 shape 会导致模型重新编译. 但在实际应用中, 尤其是自然语言处理类型的应用, 模型输入 sequence length 往往是动态的. 这种情况下, 常规的处理方法是将这些变长数据都先 pad 到 max sequence length, 然后再输入到模型. 然而这种方法会带来很多无效计算, 导致算力的实际利用率低下. 在 IPU 上, 可以使用 packing 来支持 dynamic sequence length, 提高算力利用率.

5.5.2. Packing 及 unpacking

这里通过例子来说明什么是 packing 及 unpacking. 假设模型输入长度最大是 8, batch size 是 4, 当前有 7 个不同长度的 batch size 为 1 的 request, 长度从 1 到 7, 0 表示 pad 的无效数据, 则 packing 及 unpacking 如下图所示:

../_images/packing_unpacking.png

Fig. 5.6 Packing 及 Unpacking

5.5.3. Transformer-based NLP models

自 2017 年被提出以来, Transformer 结构应用领域不断扩展, 从最初的 NLP 扩展到今天的 ASR/CV/DLRM 等领域. Transformer 包含 Encoder 和 Decoder 部分, 本文只关注 Encoder 部分. Transformer Encoder 结构如下图所示:

../_images/transformer_encoder.png

Fig. 5.7 Transformer Encoder

以 Bert 为例, Transformer Encoder 的输入 shape 通常为 (batch_size, seq_len, hidden_size). 在 Encoder 中, 除 Multi-Head Attention 模块外, 其它模块的计算都只在最后一个维度进行, 因此针对这些模块, 可以通过 packing 减少无效计算; 而 Multi-Head Attention 模块因为需要计算 token 之间的相关性, 在不修改 mask 的情况下, 必须在 unpacking 之后进行计算, 在 Multi-Head Attention 计算完成之后重新 Packing. 计算流程可以用如下伪代码表示:

packed_input from host
activation = packed_input
for encoer in encoders:
    Unpacking
    Attention
    Packing
    Add & LayerNorm
    Feed-Forward
    Add & LayerNorm
    Update activation
Unpacking
unpacked_output to host

5.5.4. 如何使用 packing

本节以 Bert-Base-Squad 为例进行说明, 本文使用的 OS 为 Ubuntu 20.04, Python 3.8.15. 本文完整示例参考 packed_bert_example .

下载模型

在下载模型之前需要先安装依赖包, 命令如下:

pip install torch==1.10.0
pip install transformers[onnx]==4.25.1

下载模型的命令如下:

python -m transformers.onnx --model=csarron/bert-base-uncased-squad-v1 . --feature question-answering

转换模型

通过上面命令下载的模型, 输入中不包含 position_ids, 而在 IPU 上使用 packing 的时候, 需要首先在 host 端将输入进行 Pack, 因此需要将 position_ids 加到模型的输入上. 代码如下:

Listing 5.3 add_position_ids.py
 1# Copyright (c) 2023 Graphcore Ltd. All rights reserved.
 2import argparse
 3import copy
 4import os
 5
 6import onnx
 7
 8# Download model from huggingface
 9# - python -m transformers.onnx --model=csarron/bert-base-uncased-squad-v1 . --feature question-answering
10# reference: https://huggingface.co/csarron/bert-base-uncased-squad-v1
11
12
13if __name__ == '__main__':
14    parser = argparse.ArgumentParser(description='Preprocess Bert-Squad Model')
15    parser.add_argument(
16        '--input_model', type=str, default='', help='path of input model'
17    )
18    args = parser.parse_args()
19
20    if not os.path.exists(args.input_model):
21        parser.print_usage()
22        raise FileNotFoundError(f'Unable to find model: {args.input_model}')
23
24    model = onnx.load(args.input_model)
25
26    # for packed bert, we need to export position_ids to model's input
27    # step 1: remove unneed node
28    rm_node_names = [
29        'Shape_7',
30        'Gather_9',
31        'Add_11',
32        'Unsqueeze_12',
33        'Slice_14',
34        'Constant_8',
35        'Constant_10',
36        'Constant_13',
37    ]
38    rm_nodes = []
39    for node in model.graph.node:
40        if node.name in rm_node_names:
41            rm_nodes.append(node)
42
43    assert len(rm_node_names) == len(rm_nodes)
44
45    for node in rm_nodes:
46        model.graph.node.remove(node)
47
48    # step 2: add position_ids to model's input
49    position_ids = copy.deepcopy(model.graph.input[0])
50    position_ids.name = 'position_ids'
51    model.graph.input.append(position_ids)
52
53    for node in model.graph.node:
54        if node.op_type == 'Gather' and node.name == 'Gather_18':
55            node.input[1] = position_ids.name
56
57    print(f'Save preprocessed model to bert_base_squad_pos.onnx')
58    onnx.save(model, 'bert_base_squad_pos.onnx')

Download add_position_ids.py

生成不使用 packing 模型:

poprt \
    --input_model squad_bert_base_pos.onnx \
    --output_model squad_bert_base_bs16_sl256.onnx \
    --precision fp16 \
    --input_shape input_ids=16,256 attention_mask=16,256 token_type_ids=16,256 position_ids=16,256

生成 packing 模型:

poprt \
    --input_model squad_bert_base_pos.onnx \
    --output_model squad_bert_base_bs16_sl256_pack.onnx \
    --precision fp16 \
    --input_shape input_ids=16,256 attention_mask=16,256 token_type_ids=16,256 position_ids=16,256 \
    --pack_args max_valid_num=40 segment_max_size=256

其中, max_valid_num 用于指定 unpacking 之后的最大 batch size, segment_max_size 表示最大的长度.

运行模型

运行模型的命令如下:

python packed_bert_example.py \
    --model_with_packing squad_bert_base_bs16_sl256_pack.onnx \
    --model_without_packing squad_bert_base_bs16_sl256.onnx

完整的代码如下:

Listing 5.4 packed_bert_example.py
  1# Copyright (c) 2023 Graphcore Ltd. All rights reserved.
  2import argparse
  3import csv
  4import os
  5import queue
  6import random
  7import sys
  8import tempfile
  9import threading
 10import time
 11
 12from multiprocessing.pool import ThreadPool
 13from queue import Queue
 14
 15import numpy as np
 16import packing_utils
 17
 18from sklearn.metrics import mean_absolute_error
 19
 20from poprt import runtime
 21from poprt.backend import get_session
 22
 23np.random.seed(2023)
 24INPUT_IDS = "input_ids"
 25POSITION_IDS = "position_ids"
 26ATTENTION_MASK = "attention_mask"
 27TOKEN_TYPE_IDS = "token_type_ids"
 28UNPACK_INFO = "unpack_info"
 29OUTPUT2 = "start_logits"
 30OUTPUT1 = "end_logits"
 31
 32
 33class BertInputs(object):
 34    def __init__(
 35        self,
 36        input_ids,
 37        attention_mask,
 38        token_type_ids,
 39        position_ids,
 40        unpack_info,
 41        input_len,
 42    ):
 43        self.input_ids = input_ids
 44        self.attention_mask = attention_mask
 45        self.token_type_ids = token_type_ids
 46        self.position_ids = position_ids
 47        self.input_len = input_len
 48        self.unpack_info = unpack_info
 49
 50
 51def get_synthetic_data(args):
 52    input_len = np.random.normal(
 53        args.avg_seq_len, args.avg_seq_len, size=args.dataset_size
 54    ).astype(np.int32)
 55    input_len = np.clip(input_len, 1, args.max_seq_len)
 56
 57    datasets = []
 58    for s_len in input_len:
 59        input_ids = np.random.randint(0, args.emb_size, (s_len)).astype(np.int32)
 60
 61        attention_mask = np.ones(s_len).astype(np.int32)
 62        token_type_ids = np.random.randint(0, 2, (s_len)).astype(np.int32)
 63
 64        position_ids = np.arange(s_len).astype(np.int32)
 65        unpack_info = np.zeros(args.max_valid_num).astype(np.int32)
 66
 67        feature = BertInputs(
 68            input_ids, attention_mask, token_type_ids, position_ids, unpack_info, s_len
 69        )
 70        datasets.append(feature)
 71
 72    return datasets
 73
 74
 75def dump_results(model_name, results):
 76    fieldnames = [OUTPUT1, OUTPUT2]
 77    filename = os.path.basename(model_name)[:-4] + 'csv'
 78    with open(filename, 'w') as f:
 79        writer = csv.DictWriter(f, fieldnames=fieldnames)
 80        for result in results:
 81            dict_name2list = {
 82                OUTPUT1: result[OUTPUT1],
 83                OUTPUT2: result[OUTPUT2],
 84            }
 85            writer.writerow(dict_name2list)
 86
 87
 88## create batched inputs and pad samples to max_seq_len
 89def padding_data(datasets, index, args):
 90    feed_dicts = {}
 91    feed_dicts[INPUT_IDS] = np.zeros(
 92        (args.batch_size, args.max_seq_len), dtype=np.int32
 93    )
 94    feed_dicts[ATTENTION_MASK] = np.zeros(
 95        (args.batch_size, args.max_seq_len), dtype=np.int32
 96    )
 97    feed_dicts[POSITION_IDS] = np.zeros(
 98        (args.batch_size, args.max_seq_len), dtype=np.int32
 99    )
100    feed_dicts[TOKEN_TYPE_IDS] = np.zeros(
101        (args.batch_size, args.max_seq_len), dtype=np.int32
102    )
103
104    for i in range(args.batch_size):
105        input_len = datasets[index].input_len
106        feed_dicts[INPUT_IDS][i][:input_len] = datasets[index].input_ids
107        feed_dicts[ATTENTION_MASK][i][:input_len] = datasets[index].attention_mask
108        feed_dicts[POSITION_IDS][i][:input_len] = datasets[index].position_ids
109        feed_dicts[TOKEN_TYPE_IDS][i][:input_len] = datasets[index].token_type_ids
110        index = index + 1
111    return feed_dicts
112
113
114# online pack, samples feeded to IPU can reach to maximum num of batches in each running turn
115def run_packing_model_with_pack_runner_unpack_repack(args, datasets):
116    tmpdir = tempfile.TemporaryDirectory()
117    # export popef for PackRunner
118    get_session(
119        args.model_with_packing_unpack_repack,
120        1,
121        "poprt",
122        output_dir=tmpdir.name,
123        export_popef=True,
124    ).load()
125    config = runtime.PackRunnerConfig(
126        timeout_microseconds=args.timeout_microseconds,
127        # max_valid_num=args.max_valid_num,
128        # dynamic_input_name=args.dynamic_input_name,
129    )
130
131    popef_path = tmpdir.name + '/executable.popef'
132    # popef_path = "/popconverter/examples/packed_bert_example/executable.popef"
133    pack_runner = runtime.Runner(popef_path, config)
134
135    result_queue = queue.Queue()
136    results = []
137    start_time = time.time()
138    for i in range(args.dataset_size):
139        feed_dicts = {
140            INPUT_IDS: datasets[i].input_ids,
141            ATTENTION_MASK: datasets[i].attention_mask,
142            TOKEN_TYPE_IDS: datasets[i].token_type_ids,
143            POSITION_IDS: datasets[i].position_ids,
144            # unpack_info should be hidden from user in the future
145            UNPACK_INFO: np.zeros(args.max_valid_num).astype(np.int32),
146        }
147        out_dict = {
148            OUTPUT1: np.zeros([args.max_seq_len]).astype(np.float16),
149            OUTPUT2: np.zeros([args.max_seq_len]).astype(np.float16),
150        }
151        future = pack_runner.execute_async(feed_dicts, out_dict)
152        result_queue.put((future, out_dict))
153    result_queue.put((None, None))
154    while True:
155        future, out_dict = result_queue.get()
156        if future == None:
157            break
158        future.wait()
159        results.append(out_dict)
160    end_time = time.time()
161
162    tput = args.dataset_size / (end_time - start_time)
163    latency_ms = (end_time - start_time) / args.dataset_size
164    print(
165        f"[Pack Online Unpack Repack] Throughput: {tput} samples/s, Latency : {latency_ms * 1000} ms"
166    )
167
168    if args.dump_results:
169        dump_results(
170            "online_unpack_repack" + args.model_with_packing_unpack_repack, results
171        )
172
173    tmpdir.cleanup()
174    return results
175
176
177# offline pack, samples feeded to IPU can reach to maximum num of batches in each running turn
178# model with pack / unpack ops
179def run_packing_model_with_model_runner(args, datasets, model_path, across_rows):
180    run_queue = queue.Queue()
181    start_time = time.time()
182    index = 0
183    for i in range(0, args.dataset_size):
184        transfer = packing_utils.pack_data(
185            datasets,
186            index,
187            args.batch_size,
188            seq_len=256,
189            max_valid_num=args.max_valid_num,
190            segment_num=1,
191            across_rows=across_rows,
192        )
193
194        run_queue.put(transfer)
195        index = transfer.count
196        if index == args.dataset_size:
197            break
198    run_queue.put(None)
199    duration_of_packing = time.time() - start_time
200    mean_latency_of_padding_us = duration_of_packing * 1e6 / args.dataset_size
201
202    print(f"Mean latency of packing data: {mean_latency_of_padding_us} us/sam")
203    print(f"Total latency of packing data: {duration_of_packing} s")
204
205    sess = get_session(model_path, 1, "poprt").load()
206
207    pool = ThreadPool(processes=1)
208
209    def execute(feed_dicts, valid_num):
210        outputs = sess.run([OUTPUT1, OUTPUT2], feed_dicts)
211        res = []
212        if across_rows:
213            for i in range(valid_num):
214                res1 = outputs[0][i].copy().tolist()
215                res2 = outputs[1][i].copy().tolist()
216                res.append({OUTPUT1: res1, OUTPUT2: res2})
217        else:
218            outlen = len(outputs[0][0])
219            for index in range(len(feed_dicts[ATTENTION_MASK])):
220                start = 0
221                arr = np.array(feed_dicts[ATTENTION_MASK][index])
222                while start < outlen and arr[start] > 0:
223                    arr = arr - 1
224                    zero_num = len(arr) - np.count_nonzero(arr)
225                    out1 = [0] * outlen
226                    out2 = [0] * outlen
227                    out1[:zero_num] = outputs[0][index][start : start + zero_num]
228                    out2[:zero_num] = outputs[1][index][start : start + zero_num]
229                    res.append({OUTPUT1: out1, OUTPUT2: out2})
230                    start += zero_num
231        return res
232
233    asy_results = []
234
235    total_start_time = time.time()
236    while True:
237        input_data = run_queue.get()
238        if input_data is None:
239            break
240
241        feed_dicts = {
242            INPUT_IDS: input_data.data[INPUT_IDS],
243            ATTENTION_MASK: input_data.data[ATTENTION_MASK],
244            TOKEN_TYPE_IDS: input_data.data[TOKEN_TYPE_IDS],
245            POSITION_IDS: input_data.data[POSITION_IDS],
246            # unpack_info should be hidden from user in the future
247            UNPACK_INFO: input_data.unpack_info,
248        }
249        if not across_rows:
250            feed_dicts.pop(UNPACK_INFO)
251
252        valid_num = len(input_data.specs)
253        async_result = pool.apply_async(execute, (feed_dicts, valid_num))
254        asy_results.append(async_result)
255
256    results = []
257    for asy in asy_results:
258        for res in asy.get():
259            results.append(res)
260    total_end_time = time.time()
261
262    tput = len(results) / (total_end_time - total_start_time)
263    latency = (total_end_time - total_start_time) / len(results)
264    if across_rows:
265        print(
266            f"[Pack Offline Unpack Repack] Throughput: {tput} samples/s, Latency: {latency*1000} ms"
267        )
268    else:
269        print(
270            f"[Pack Offline AttentionMask] Throughput: {tput} samples/s, Latency: {latency*1000} ms"
271        )
272
273    if args.dump_results:
274        dump_results("offline_" + model_path, results)
275
276    return results
277
278
279# online pack, samples feeded to IPU can reach to maximum num of batches in each running turn
280# model only add AttentionMask op in this mode
281def run_packing_model_with_pack_runner_attention_mask(args, datasets, algo):
282    tmpdir = tempfile.TemporaryDirectory()
283    # export popef for PackRunner
284    get_session(
285        args.model_with_packing_attention_mask,
286        1,
287        "poprt",
288        output_dir=tmpdir.name,
289        export_popef=True,
290    ).load()
291    config = runtime.PackRunnerConfig(
292        timeout_microseconds=args.timeout_microseconds,
293        max_valid_num=args.max_valid_num,
294        dynamic_input_name=args.dynamic_input_name,
295    )
296
297    if algo == "next_fit":
298        config.algorithm = runtime.PackAlgorithm.next_fit
299    else:
300        config.algorithm = runtime.PackAlgorithm.first_fit
301
302    config.enable_input_single_row_mode("attention_mask")
303    popef_path = tmpdir.name + '/executable.popef'
304    # popef_path = "/popconverter/examples/packed_bert_example/executable.popef"
305    pack_runner = runtime.Runner(popef_path, config)
306
307    result_queue = queue.Queue()
308    results = []
309    start_time = time.time()
310    for i in range(args.dataset_size):
311        feed_dicts = {
312            INPUT_IDS: datasets[i].input_ids,
313            ATTENTION_MASK: datasets[i].attention_mask,
314            TOKEN_TYPE_IDS: datasets[i].token_type_ids,
315            POSITION_IDS: datasets[i].position_ids,
316        }
317        out_dict = {
318            OUTPUT1: np.zeros([args.max_seq_len]).astype(np.float16),
319            OUTPUT2: np.zeros([args.max_seq_len]).astype(np.float16),
320        }
321        future = pack_runner.execute_async(feed_dicts, out_dict)
322        result_queue.put((future, out_dict))
323    result_queue.put((None, None))
324    while True:
325        future, out_dict = result_queue.get()
326        if future == None:
327            break
328        future.wait()
329        results.append(out_dict)
330    end_time = time.time()
331
332    tput = args.dataset_size / (end_time - start_time)
333    latency_ms = (end_time - start_time) / args.dataset_size
334    print(
335        f"[Pack Online AttentionMask({algo})] Throughput: {tput} samples/s, Latency : {latency_ms * 1000} ms"
336    )
337
338    if args.dump_results:
339        dump_results(
340            "online_attention_mask_"
341            + algo
342            + "_"
343            + args.model_with_packing_attention_mask,
344            results,
345        )
346
347    tmpdir.cleanup()
348    return results
349
350
351def latency_distribuion_with_pack_runner_attention_mask(args, datasets, algo):
352    tmpdir = tempfile.TemporaryDirectory()
353    # export popef for PackRunner
354    get_session(
355        args.model_with_packing_attention_mask,
356        1,
357        "poprt",
358        output_dir=tmpdir.name,
359        export_popef=True,
360    ).load()
361    config = runtime.PackRunnerConfig(
362        timeout_microseconds=args.timeout_microseconds,
363        max_valid_num=args.max_valid_num,
364        dynamic_input_name=args.dynamic_input_name,
365    )
366
367    if algo == "next_fit":
368        config.algorithm = runtime.PackAlgorithm.next_fit
369    else:
370        config.algorithm = runtime.PackAlgorithm.first_fit
371
372    config.enable_input_single_row_mode("attention_mask")
373    popef_path = tmpdir.name + '/executable.popef'
374    # popef_path = "/popconverter/examples/packed_bert_example/executable.popef"
375    pack_runner = runtime.Runner(popef_path, config)
376
377    sample_num = args.batch_size * args.iterations
378    clients = int(args.batch_size * 3.5)
379    count_percent = 0.6
380
381    q = Queue()
382
383    def perf_count(model_runner, iteration):
384        durations = []
385        for i in range(sample_num):
386            start_time = time.time()
387            random.randint(0, sample_num)
388            feed_dicts = {
389                INPUT_IDS: datasets[i].input_ids,
390                ATTENTION_MASK: datasets[i].attention_mask,
391                TOKEN_TYPE_IDS: datasets[i].token_type_ids,
392            }
393            out_dict = {
394                OUTPUT1: np.zeros([args.max_seq_len]).astype(np.float16),
395                OUTPUT2: np.zeros([args.max_seq_len]).astype(np.float16),
396            }
397            pack_runner.execute(feed_dicts, out_dict)
398            end_time = time.time()
399            durations.append((start_time, end_time))
400        # remove first and last example's time counter
401        ignored_samples = int(sample_num * (1 - count_percent) / 2)
402        durations = durations[ignored_samples:-ignored_samples]
403        q.put(durations, timeout=10)
404
405    thp = [
406        threading.Thread(target=perf_count, args=(pack_runner, args.iterations))
407        for _ in range(clients)
408    ]
409    for t in thp:
410        t.start()
411    for t in thp:
412        t.join()
413
414    durations_from_th = []
415    while not q.empty():
416        durations_from_th += q.get()
417    max_timestamp = max(y for _, y in durations_from_th)
418    min_timestamp = min(x for x, _ in durations_from_th)
419    clients * (sample_num * count_percent) / (max_timestamp - min_timestamp)
420    times_range = [y - x for x, y in durations_from_th]
421
422    times_range.sort()
423    tail_latency = round(times_range[int(len(times_range) * 0.99)] * 1000, 2)
424    avg_latency = round(sum(times_range) / len(times_range) * 1000, 2)
425
426    print(f"Average Latency: {avg_latency}ms, P99 latency: {tail_latency}ms.")
427    return tail_latency, avg_latency
428
429
430# no pack, padding each line with 0 if input length is not long enough.
431# samples num equals to batch at every running turn
432def run_original_model_with_model_runner(args, datasets):
433    run_queue = queue.Queue()
434    start_time = time.time()
435    for i in range(0, args.dataset_size, args.batch_size):
436        feed_dicts = padding_data(datasets, i, args)
437        run_queue.put((args.batch_size, feed_dicts))
438    run_queue.put((0, None))
439    duration_of_padding_s = time.time() - start_time
440
441    mean_latency_of_padding_us = duration_of_padding_s * 1e6 / args.dataset_size
442    print(f"Mean latency of padding data: {mean_latency_of_padding_us} us/sam")
443    print(f"Total latency of padding data: {duration_of_padding_s} s")
444
445    sess = get_session(args.model_without_packing, 1, "poprt").load()
446
447    asy_results = []
448
449    def execute(feed_dicts, valid_num):
450        outputs = sess.run([OUTPUT1, OUTPUT2], feed_dicts)
451        res = []
452        for i in range(valid_num):
453            res1 = outputs[0][i].copy().tolist()
454            res2 = outputs[1][i].copy().tolist()
455            res.append({OUTPUT1: res1, OUTPUT2: res2})
456        return res
457
458    # execute
459    pool = ThreadPool(processes=1)
460    total_start_time = time.time()
461    while True:
462        valid_num, feed_dicts = run_queue.get()
463        if feed_dicts is None:
464            break
465        async_result = pool.apply_async(execute, (feed_dicts, valid_num))
466        asy_results.append(async_result)
467    results = []
468    for asy in asy_results:
469        for res in asy.get():
470            results.append(res)
471    total_end_time = time.time()
472
473    tput = len(results) / (total_end_time - total_start_time)
474    latency = (total_end_time - total_start_time) / len(results)
475
476    if args.dump_results:
477        dump_results("original_" + args.model_without_packing, results)
478
479    print(f"[Original] Throughput: {tput} samples/s, Latency: {latency *1000} ms")
480
481    return results
482
483
484def calculate_mae(expected_results, output_results, datasets, enable_debug):
485    assert len(datasets) == len(expected_results)
486    assert len(datasets) == len(output_results)
487    maes = []
488    zipped_data = zip(datasets, expected_results, output_results)
489    for i, (data, expected, output) in enumerate(zipped_data):
490        np.testing.assert_equal(len(expected), len(output))
491        input_len = data.input_len
492        output_1_mae = mean_absolute_error(
493            expected[OUTPUT1][:input_len], output[OUTPUT1][:input_len]
494        )
495        output_2_mae = mean_absolute_error(
496            expected[OUTPUT2][:input_len], output[OUTPUT2][:input_len]
497        )
498        maes.append([i, output_1_mae, output_2_mae])
499
500    k = 10 if len(datasets) > 10 else len(datasets)
501
502    def print_topk(k, out_name, out_index):
503        for i in range(1, k + 1):
504            print(f"Sample: {maes[-i][0]}, {out_name} mae : {maes[-i][out_index]}")
505
506    if enable_debug:
507        maes.sort(key=lambda e: e[1])
508        print(f"\n***** Top {k} mae of output: {OUTPUT1} *****")
509        print_topk(k, OUTPUT1, 1)
510
511        maes.sort(key=lambda e: e[2])
512        print(f"\n***** Top {k} mae of output: {OUTPUT2} *****")
513        print_topk(k, OUTPUT2, 2)
514
515    print(f"{OUTPUT1} average mae: {np.mean(maes,axis=0)[1]}")
516    print(f"{OUTPUT2} average mae: {np.mean(maes,axis=0)[2]}")
517
518
519def main():
520    parser = argparse.ArgumentParser(description='packed bert-base-squad')
521    parser.add_argument(
522        '--avg_seq_len', type=int, default=128, help='average sequence length of input'
523    )
524    parser.add_argument(
525        '--batch_size', type=int, default=16, help='batch size of model'
526    )
527    parser.add_argument('--dump_results', action='store_true', help='dump results')
528    parser.add_argument(
529        '--dynamic_input_name', type=str, default=INPUT_IDS, help='dynamic input name'
530    )
531    parser.add_argument(
532        '--emb_size', type=int, default=30522, help='word embedding table size'
533    )
534    parser.add_argument(
535        '--enable_debug', action='store_true', help='enable output debug info'
536    )
537    parser.add_argument(
538        '--iterations', type=int, default=100, help='number of batches to run'
539    )
540    parser.add_argument(
541        '--max_seq_len', type=int, default=256, help='max sequence length of input'
542    )
543    parser.add_argument(
544        '--max_valid_num', type=int, default=40, help='max valid num for pack'
545    )
546    parser.add_argument(
547        '--model_without_packing', help='model without pack, unpack, repack op'
548    )
549    parser.add_argument(
550        '--model_with_packing_unpack_repack',
551        help='model with pack, unpack, repack op converted by PopRT',
552    )
553    parser.add_argument(
554        '--model_with_packing_attention_mask',
555        help='model with AttentionMask op converted by PopRT',
556    )
557    parser.add_argument(
558        '--timeout_microseconds',
559        type=int,
560        default=15000,
561        help='timeout in microseconds',
562    )
563
564    args = parser.parse_args()
565    args.dataset_size = args.iterations * args.batch_size
566
567    # generate synthetic dataset
568    datasets = get_synthetic_data(args)
569    original_result = run_original_model_with_model_runner(args, datasets)
570
571    offline_pack_result_unpack_repack = run_packing_model_with_model_runner(
572        args, datasets, args.model_with_packing_unpack_repack, True
573    )
574    online_pack_result_unpack_repack = run_packing_model_with_pack_runner_unpack_repack(
575        args, datasets
576    )
577
578    offline_pack_result_attention_mask = run_packing_model_with_model_runner(
579        args, datasets, args.model_with_packing_attention_mask, False
580    )
581    online_pack_result_attention_mask_first_fit = (
582        run_packing_model_with_pack_runner_attention_mask(args, datasets, "first_fit")
583    )
584    online_pack_result_attention_mask_next_fit = (
585        run_packing_model_with_pack_runner_attention_mask(args, datasets, "next_fit")
586    )
587    latency_distribuion_with_pack_runner_attention_mask(args, datasets, "first_fit")
588
589    # compare the results
590    print("\nCompare results between original and online pack(with unpack repack)")
591    calculate_mae(
592        original_result, online_pack_result_unpack_repack, datasets, args.enable_debug
593    )
594    print("\nCompare results between offline and online pack with unpack repack op")
595    calculate_mae(
596        offline_pack_result_unpack_repack,
597        online_pack_result_unpack_repack,
598        datasets,
599        args.enable_debug,
600    )
601
602    print(
603        "\nCompare results between original and online_first_fit pack with attention_mask op"
604    )
605    calculate_mae(
606        original_result,
607        online_pack_result_attention_mask_first_fit,
608        datasets,
609        args.enable_debug,
610    )
611    print(
612        "\nCompare results between original and online_next_fit pack with attention_mask op"
613    )
614    calculate_mae(
615        original_result,
616        online_pack_result_attention_mask_next_fit,
617        datasets,
618        args.enable_debug,
619    )
620
621    print(
622        "\nCompare results between offline and online_next_fit pack with attenttion_mask op"
623    )
624    calculate_mae(
625        offline_pack_result_attention_mask,
626        online_pack_result_attention_mask_next_fit,
627        datasets,
628        args.enable_debug,
629    )
630
631
632if __name__ == "__main__":
633    sys.exit(main())

Download add_position_ids.py

运行完成后, 将输出类似如下信息:

[Original] Throughput: 1860.9792005501781 samples/s, Latency: 0.5373515188694 ms
....
[Pack Offline] Throughput: 2830.8140869025283 samples/s, Latency: 0.3532552719116211 ms
....
[Pack Online] Throughput: 2782.587696947809 samples/s, Latency : 0.3593777120113373 ms
....