'2021/10/13'에 해당되는 글 3건

  1. 2021.10.13 i.MX 8M PLUS tensorflow NPU
  2. 2021.10.13 i.MX 8M PLUS
  3. 2021.10.13 2.7.0-rc with opencl
embeded/i.mx 8m plus2021. 10. 13. 14:38

LF_v5.10.52-2.1.0_images_IMX8MPEVK.zip 파일을 받아서 이미지를 sd 카드에 굽고

부팅해서 들어가보니 경로가 좀 다르다.

tensorflow 2.5.0 버전이면.. 쓸 수 있는 건가?

# cd /usr/bin/tensorflow-lite-2.5.0/examples
# ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite
STARTING!
Log parameter values verbosely: [0]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
Use VXdelegate : [0]
Loaded model mobilenet_v1_1.0_224_quant.tflite
The input model file size (MB): 4.27635
Initialized session in 1.807ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=4 first=167959 curr=162606 min=162606 max=167959 avg=164253 std=2159

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=162727 curr=163003 min=162308 max=163308 avg=162758 std=190

Inference timings in us: Init: 1807, First inference: 167959, Warmup (avg): 164253, Inference (avg): 162758
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.51562 overall=8.64062

# ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true
STARTING!
Log parameter values verbosely: [0]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
Use NNAPI: [1]
NNAPI accelerators available: [vsi-npu]
Use VXdelegate : [0]
Loaded model mobilenet_v1_1.0_224_quant.tflite
INFO: Created TensorFlow Lite delegate for NNAPI.
Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 4.27635
Initialized session in 4.183ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=4649626

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=360 first=2665 curr=2733 min=2632 max=2783 avg=2715.67 std=16

Inference timings in us: Init: 4183, First inference: 4649626, Warmup (avg): 4.64963e+06, Inference (avg): 2715.67
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.59766 overall=30.1836

 

label_image로 해보면.. warm up이 먼진 모르겠지만 invoke() 함수 자체는 짧게 걸리는데

그 이전에 먼가 하는게 오래 걸리는지 cpu만으로 돌리는 것 보다 4초 이상 오래 걸린다.

# time ./label_image -w 1
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: invoked
INFO: average time: 43.865 ms
INFO: 0.764706: 653 military uniform
INFO: 0.121569: 907 Windsor tie
INFO: 0.0156863: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

real    0m0.142s
user    0m0.385s
sys     0m0.020s

# time ./label_image -w 1 -a 1
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: Use NNAPI acceleration.
INFO: Applied NNAPI delegate.
INFO: invoked
INFO: average time: 2.797 ms
INFO: 0.768627: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0196078: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

real    0m4.748s
user    0m4.648s
sys     0m0.092s

 

아래는 2.1.0 버전에 맞춰서 한 구버전 문서 내용 인 듯.

$ cd /usr/bin/tensorflow-lite-2.1.0/examples
$ ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite
$: ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true

./lbl_img -i grace_hopper.bmp -l labels.txt -w 1
./lbl_img -i grace_hopper.bmp -l labels.txt -w 1 -a 1

[링크 : https://www.mouser.com/pdfDocs/AN12964.pdf]

 

 

+

망할 놈(?)들 도움말이랑 다르잖아?!

# ./label_image --help
ERROR: usage: ./label_image <flags>
Flags:
        --num_threads=1                 int32   optional        number of threads used for inference on CPU.
        --max_delegated_partitions=0    int32   optional        Max number of partitions to be delegated.
        --min_nodes_per_partition=0     int32   optional        The minimal number of TFLite graph nodes of a partition that has to be reached for it to be delegated.A negative value or 0 means to use the default choice of each delegate.
        --num_threads=1                 int32   optional        number of threads used for inference on CPU.
        --max_delegated_partitions=0    int32   optional        Max number of partitions to be delegated.
        --min_nodes_per_partition=0     int32   optional        The minimal number of TFLite graph nodes of a partition that has to be reached for it to be delegated.A negative value or 0 means to use the default choice of each delegate.
        --use_xnnpack=false             bool    optional        use XNNPack
        --use_nnapi=false               bool    optional        use nnapi delegate api
        --nnapi_execution_preference=   string  optional        execution preference for nnapi delegate. Should be one of the following: fast_single_answer, sustained_speed, low_power, undefined
        --nnapi_execution_priority=     string  optional        The model execution priority in nnapi, and it should be one of the following: default, low, medium and high. This requires Android 11+.
        --nnapi_accelerator_name=       string  optional        the name of the nnapi accelerator to use (requires Android Q+)
        --disable_nnapi_cpu=true        bool    optional        Disable the NNAPI CPU device
        --nnapi_allow_fp16=false        bool    optional        Allow fp32 computation to be run in fp16

 

    static struct option long_options[] = {
        {"accelerated", required_argument, nullptr, 'a'},
        {"allow_fp16", required_argument, nullptr, 'f'},
        {"count", required_argument, nullptr, 'c'},
        {"verbose", required_argument, nullptr, 'v'},
        {"image", required_argument, nullptr, 'i'},
        {"labels", required_argument, nullptr, 'l'},
        {"tflite_model", required_argument, nullptr, 'm'},
        {"profiling", required_argument, nullptr, 'p'},
        {"threads", required_argument, nullptr, 't'},
        {"input_mean", required_argument, nullptr, 'b'},
        {"input_std", required_argument, nullptr, 's'},
        {"num_results", required_argument, nullptr, 'r'},
        {"max_profiling_buffer_entries", required_argument, nullptr, 'e'},
        {"warmup_runs", required_argument, nullptr, 'w'},
        {"gl_backend", required_argument, nullptr, 'g'},
        {"hexagon_delegate", required_argument, nullptr, 'j'},
        {"xnnpack_delegate", required_argument, nullptr, 'x'},
        {nullptr, 0, nullptr, 0}};

[링크 : https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/lite/examples/label_image/label_image.cc]

 

+

그러면.. 어떤식으로 라이브러리를 빌드해서 저게 가능해진거지?

# ldd label_image
        linux-vdso.so.1 (0x0000ffffa0989000)
        libtensorflow-lite.so.2.5.0 => /usr/lib/libtensorflow-lite.so.2.5.0 (0x0000ffffa05ab000)
        libm.so.6 => /lib/libm.so.6 (0x0000ffffa0501000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x0000ffffa032a000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x0000ffffa0305000)
        libc.so.6 => /lib/libc.so.6 (0x0000ffffa0190000)
        /lib/ld-linux-aarch64.so.1 (0x0000ffffa0957000)
        libtim-vx.so => /usr/lib/libtim-vx.so (0x0000ffffa00c7000)
        libdl.so.2 => /lib/libdl.so.2 (0x0000ffffa00b1000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x0000ffffa0082000)
        librt.so.1 => /lib/librt.so.1 (0x0000ffffa006a000)
        libovxlib.so.1.1.0 => /usr/lib/libovxlib.so.1.1.0 (0x0000ffff9fcd1000)
        libOpenVX.so.1 => /usr/lib/libOpenVX.so.1 (0x0000ffff9fa7e000)
        libVSC.so => /usr/lib/libVSC.so (0x0000ffff9eae2000)
        libGAL.so => /usr/lib/libGAL.so (0x0000ffff9e91b000)
        libArchModelSw.so => /usr/lib/libArchModelSw.so (0x0000ffff9e8f3000)
        libNNArchPerf.so => /usr/lib/libNNArchPerf.so (0x0000ffff9e8d0000)

 

 

+

PRELU 연산자 자체는 지원하는 것 같은데 output size mistach가 원인인가?

INFO: Use NNAPI acceleration.
WARNING: Operator RESIZE_BILINEAR (v3) refused by NNAPI delegate: Operator refused due performance reasons.
INFO: Applied NNAPI delegate.
W [vsi_nn_op_eltwise_setup:178]Output size mismatch, expect 917504, but got 50176
E [setup_node:448]Setup node[52] PRELU fail
W [vsi_nn_op_eltwise_setup:178]Output size mismatch, expect 917504, but got 50176
E [setup_node:448]Setup node[52] PRELU fail
ERROR: NN API returned error ANEURALNETWORKS_BAD_DATA at line 4151 while running computation.

ERROR: Node number 56 (TfLiteNnapiDelegate) failed to invoke.

ERROR: Failed to invoke tflite!

[링크 : https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf]

 

 

+

warm up은 코드상으로 1회 invoke 하는 것인데 해당 작업이 4649ms 정도 소요되며

warm up 없이 1회 실행하면 대략 그 정도 시간이 소요된다.

root@imx8mpevk:/usr/bin/tensorflow-lite-2.5.0/examples# time ./label_image -a 1 -w 0 -p 1 -c 1
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: Use NNAPI acceleration.
INFO: Applied NNAPI delegate.
INFO: invoked
INFO: average time: 4649.78 ms
INFO: 0.768627: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0196078: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

real    0m4.757s
user    0m4.655s
sys     0m0.096s
root@imx8mpevk:/usr/bin/tensorflow-lite-2.5.0/examples# time ./label_image -a 1 -w 0 -p 1 -c 4
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: Use NNAPI acceleration.
INFO: Applied NNAPI delegate.
INFO: invoked
INFO: average time: 1164.36 ms
INFO: 0.768627: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0196078: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

real    0m4.768s
user    0m4.663s
sys     0m0.092s
root@imx8mpevk:/usr/bin/tensorflow-lite-2.5.0/examples# time ./label_image -a 1 -w 0 -p 1 -c 10000
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: Use NNAPI acceleration.
INFO: Applied NNAPI delegate.
INFO: invoked
INFO: average time: 3.30189 ms
INFO: 0.768627: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0196078: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

real    0m33.128s
user    0m7.516s
sys     0m1.590s

 

openVX를 통해 처리하는 것 같은데 처음 처리하면 그래프 처리 결과를 스토리지에 저장한다고.

11.3 Hardware accelerators warmup time
For both Arm NN and TensorFlow Lite, the initial execution of model inference takes longer time, because of the model graph initialization needed by the GPU/NPU hardware accelerator. The initialization phase is known as warmup. This time duration can be decreased for subsequent application that runs by storing on disk the information resulted from the initial OpenVX graph processing. The following environment variables should be used for this purpose:
VIV_VX_ENABLE_CACHE_GRAPH_BINARY: flag to enable/disable OpenVX graph caching
VIV_VX_CACHE_BINARY_GRAPH_DIR: set location of the cached information on disk
For example, set these variables on the console in this way:
export VIV_VX_ENABLE_CACHE_GRAPH_BINARY="1"
export VIV_VX_CACHE_BINARY_GRAPH_DIR=`pwd`

[링크 : https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf]

'embeded > i.mx 8m plus' 카테고리의 다른 글

i.mx8m plus win iot 실행  (0) 2023.02.23
i.mx8 tensilica dsp  (0) 2023.02.07
i.mx8m plus win iot  (0) 2023.02.02
imx 8m plus NPU 에러 추적  (5) 2021.10.14
i.MX 8M PLUS  (0) 2021.10.13
Posted by 구차니
embeded/i.mx 8m plus2021. 10. 13. 11:44

오잉? 저번에 볼 땐 8M PLUS에는 cortex-M 계열 없었던 것 같은데?!?!

[링크 : https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors/i-mx-8m-plus-arm-cortex-a53-machine-learning-vision-multimedia-and-industrial-iot:IMX8MPLUS]

 

음.. 그냥 내 눈이 삐꾸인걸로 -_ㅠ

[링크 : https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors:IMX8-SERIES]

 

아무튼 회사에 굴러(?) 다니는 이 녀석 사용해보려니

헐.. 무슨 디버그 포트가 이렇게 많이 인식 돼? 일단 나의 경우에는 리눅스가 COM27로 연결되었다.

[링크 : https://www.nxp.com/design/development-boards/i-mx-evaluation-and-development-boards/evaluation-kit-for-the-i-mx-8m-plus-applications-processor:8MPLUSLPD4-EVK]

 

패키지에 들어있던 종이 쪼가리구만 -_-

첫째랑 둘째는 누구꺼냐!?

Four UART connections will appear on the PC, the third port for the Cortex-A53 core and the fourth for Cortex-M7 core system debugging.

[링크 : https://www.nxp.com/docs/en/quick-reference-guide/8MPLUSEVKQSG.pdf]

 

Proejct - Tutorial에 Machine Learning

[링크 : https://www.nxp.com/document/guide/getting-started-with-the-i-mx-8m-plus-evk:GS-iMX-8M-Plus-EVK]

 

i.MX 8M PLUS 에는 전체 기능을 다 지원하는데

NPU를 써볼려면 eIQ를 이용해서 먼가 짓을 해야 하는 것 같고.

Cortex-M7도 있으니 (standalone 혹은 collaborative 하게 작동이 가능하다고) 이걸 이용해서 일종의 가속기화 하려나?

 

TFLite

[링크 : https://www.nxp.com/design/software/development-software/eiq-ml-development-environment/eiq-inference-with-tensorflow-lite:eIQTensorFlowLite]

 

TFLite for MCU

[링크 : https://www.nxp.com/design/software/development-software/eiq-ml-development-environment/eiq-inference-with-tensorflow-lite-micro:EIQ-TFLITE-MICRO]

 

위에서 다운로드 링크 누르니 이상한데(?)로 보내버리네

[링크 : https://mcuxpresso.nxp.com/en/welcome] cortex-M7 쓰려면 이게 필요한 듯. 이클립스 기반?

[링크 : https://source.codeaurora.org/external/imx/imx-manifest]

 

오오 i.MX 8M Plus!!

Cortex-A / GPU / NPU 오오오...

[링크 : https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf]

 

 

+

이미지 받아보니 아래와 같이 구성되어 있다.

귀찮으면 fsl-image-validation-imx-imx8mmevk.sdcard 를 sd에 구워서 켜보면 될 듯.

 

imx_m4_demos에는 bin 파일이 있는데 이건 어떻게 올려서 쓰려나?

 

MCUXpresso 안쓰면 uboot에서 해당 파일을 직접 sd에 넣어 실행하는 수 밖에 없나?

4.2 Run applications using U-Boot

This section describes how to run applications using an SD card and pre-built U-Boot image for i.MX processor.
  1. Following the steps from section 2—Embedded Linux of this Getting Started guide, prepare an SD card with a pre-built U-Boot + Linux image from the Linux BSP package for the i.MX 8M Plus processor. If you have already loaded the SD card with a Linux image, you can skip this step.
  2. Insert the SD card in the host computer (Linux or Windows) and copy the application image (for example hello_world.bin) to the FAT partition of the SD card.
  3. Safely remove the SD card from the PC.
  4. Insert the SD card to the target board. Make sure to use the default boot SD slot and double check the Boot switch setup.
  5. Connect the DEBUG UART connector on the board to the PC through USB cable. The Windows OS installs the USB driver automatically, and the Ubuntu OS will find the serial devices as well.
    See Connect USB debug cable section in Out of box for more instructions on serial communication applications.
  6. Open a second terminal on the i.MX8M Plus EVK board’s second enumerated serial port. This is the Cortex®-M7’s serial console. Set the speed to 115200 bit/s, data bits 8, 1 stop bit (115200, 8N1), no parity.
  7. Power up the board and stop the boot process by pressing any key before the U-Boot countdown reaches zero. At the U-Boot prompt on the first terminal, type the following commands.
    => fatload mmc 0:1 0x48000000 hello_world.bin
    => cp.b 0x48000000 0x7e0000 0x20000
    => bootaux 0x7e0000
    These commands copy the image file from the first partition of the SD card into the Cortex®-M7’s TCM and releases the Cortex®-M7 from reset.

 

 

리눅스에서 /sys 등으로 접근할 순 없나?

wic 파일을 win32diskimager로 구우면 되려나?

[링크 : https://www.nxp.com/docs/en/user-guide/IMX_LINUX_USERS_GUIDE.pdf]

[링크 : https://www.nxp.com/part/8MPLUSLPD4-EVK#/]

 

MCUXpresso 로 imx8m quad 선택해서 빌드한다?

[링크 : https://www.embeddedartists.com/wp-content/uploads/2019/03/iMX8M_Working_with_Cortex-M.pdf]

'embeded > i.mx 8m plus' 카테고리의 다른 글

i.mx8m plus win iot 실행  (0) 2023.02.23
i.mx8 tensilica dsp  (0) 2023.02.07
i.mx8m plus win iot  (0) 2023.02.02
imx 8m plus NPU 에러 추적  (5) 2021.10.14
i.MX 8M PLUS tensorflow NPU  (0) 2021.10.13
Posted by 구차니

빌드해보려다가 라즈베리 4에서 실패인지 성공인지 미묘하게 완료

 

$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow/
$ git checkout v2.7.0-rc0

$ mkdir ../tflite_build
$ cd ../tflite_build
$ cmake ../tensorflow/tensorflow/lite/ -DTFLITE_ENABLE_GPU=ON

[링크 : https://www.tensorflow.org/lite/guide/build_cmake]

 

문제는 rpi 3용 videocore IV는 openCL 사용자 버전이 있는데

rpi 4용 videocore VI는 아직 안나와서 쓸수가 없을 듯 ㅠㅠ

/home/pi/work/tflite_build/opencl_headers/CL/cl_version.h:34:104: note: #pragma message: cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 220 (OpenCL 2.2)
 #pragma message("cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 220 (OpenCL 2.2)")

[링크 : https://github.com/doe300/VC4CL/issues/86]

[링크 : https://github.com/Idein/py-videocore6]

[링크 : https://forums.raspberrypi.com/viewtopic.php?t=312646]

'프로그램 사용 > yolo_tensorflow' 카테고리의 다른 글

bazel cross compile  (0) 2022.01.27
bazel clean  (0) 2021.10.19
tf release 2.7.0-rc  (0) 2021.10.12
tflite delegate  (0) 2021.10.11
tflite gpu openCL support build fail  (0) 2021.08.31
Posted by 구차니