MLCD-Embodied 🤖

Hugging Face

MLCD-Embodied is comparable to 4v in terms of embodied capabilities and possesses excellent general capabilities. The detailed evaluation results are shown below.

Image 1 Image 2

Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA

MLCD-Embodied-7B

LLaVA OneVision-7B

GPT-4V

RoboMamba

RoboVQA

BLEU1

73.16

38.12

-

54.9

BLEU2

66.39

33.56

-

44.2

BLEU3

60.61

31.76

-

39.5

BLEU4

56.56

30.97

-

36.3

OpenEQA

OBJECT-STATE-RECOGNITION

71.83

-

63.2

-

OBJECT-RECOGNITION

49.46

-

43.4

-

FUNCTIONAL-REASONING

54.38

-

57.4

-

SPATIAL-UNDERSTANDING

48.64

-

33.6

-

ATTRIBUTE-RECOGNITION

67.08

-

57.2

-

WORLD-KNOWLEDGE

53.87

-

50.7

-

OBJECT-LOCALIZATION

43.06

-

42.0

-

General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4

Dataset

Split

MLCD-Embodied-7B

LLaVA OneVision-7B

GPT-4v

GPT-4o

A12D

test

79.9

81.4

78.2

94.2

ChartQA

test

83.0

80.0

78.5

85.7

DocVQA

test

91.6

87.5

88.4

92.8

InfoVQA

val

73.9

70.7

-

-

InfoVQA

test

70.0

68.8

-

-

MMMU

val

47.3

48.8

56.8

69.1

MMStar

test

58.5

61.7

57.1

63.9

OCRBench

-

749.0

697.0

656.0

805.0

RealWorldQA

test

68.9

66.3

61.4

58.6

SeedBench

image

74.9

75.4

49.9

76.2

MMbench

en-dev

81.1

83.2

81.3

83.4

MMbench

en-test

80.1

80.8

75.0

-

MME

test

578/1603

418/1580

517/1409

-

Usage

A. Installation

git clone https://github.com/deepglint/unicom
cd unicom

# Upgrade pip and install necessary dependencies
pip install --upgrade pip
pip install -e ".[train]"

B. Inference

CUDA_VISIBLE_DEVICES=0 python infer.py --model_dir /path/to/your/model

# example:
# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
# >> Enter image file paths (comma-separated): ./asserts/logo.png
# >> User: <image>What kind of animal is it in this picture?
# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
# >> User: What color is this cat?
# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.

C. Evaluation for Embodied Ability

Step 1

Download raw data following OpenEQA and RoboVQA(val part)

Step 2

Converting raw data into the format required for model evaluation.

# convert OpenEQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_openeqa_bmk.py

# convert RoboVQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_robovqa_bmk.py

Step 3

Make sure that your top-level directory structure should look like this:

|--/path/to/your/benchmarks
|  |--OpenEQA
|  |  |--openeqa_scannet.parquet
|  |  |--openeqa_hm3d.parquet
|  |--RoboVQA
|     |--robovqa.parquet
|--/path/to/your/images
   |--openeqa_val
   |  |--scannet-v0
   |  |  |--002-scannet-scene0709_00
   |  |  |--xxx-scannet-scenexxxx_xx
   |  |--hm3d-v0
   |     |--000-hm3d-BFRyYbPCCPE
   |     |--xxx-hm3d-xxxxxxxxxxx
   |--robovqa_val
      |--robovqa_221911
      |--robovqa_xxxxxx

Step 4

Run script for evaluation

# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
bash scripts/eval/eval_robo.sh /path/to/your/model

D. Evaluation for General Ability

Install the evaluation tool and execute the evaluation script:

pip install lmms-eval==0.2.0
PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
    --main_process_port=12444 \
    --num_processes=8 \
    -m lmms_eval \
    --model llava \
    --model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
    --tasks mme \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix mlcd \
    --output_path ./eval_log/