MLCD-Embodied 🤖

MLCD-Embodied is comparable to 4v in terms of embodied capabilities and possesses excellent general capabilities. The detailed evaluation results are shown below.

Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA

		MLCD-Embodied-7B	LLaVA OneVision-7B	GPT-4V	RoboMamba
RoboVQA	BLEU1	73.16	38.12	-	54.9
	BLEU2	66.39	33.56	-	44.2
	BLEU3	60.61	31.76	-	39.5
	BLEU4	56.56	30.97	-	36.3
OpenEQA	OBJECT-STATE-RECOGNITION	71.83	-	63.2	-
	OBJECT-RECOGNITION	49.46	-	43.4	-
	FUNCTIONAL-REASONING	54.38	-	57.4	-
	SPATIAL-UNDERSTANDING	48.64	-	33.6	-
	ATTRIBUTE-RECOGNITION	67.08	-	57.2	-
	WORLD-KNOWLEDGE	53.87	-	50.7	-
	OBJECT-LOCALIZATION	43.06	-	42.0	-

General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4

Dataset	Split	MLCD-Embodied-7B	LLaVA OneVision-7B	GPT-4v	GPT-4o
A12D	test	79.9	81.4	78.2	94.2
ChartQA	test	83.0	80.0	78.5	85.7
DocVQA	test	91.6	87.5	88.4	92.8
InfoVQA	val	73.9	70.7	-	-
InfoVQA	test	70.0	68.8	-	-
MMMU	val	47.3	48.8	56.8	69.1
MMStar	test	58.5	61.7	57.1	63.9
OCRBench	-	749.0	697.0	656.0	805.0
RealWorldQA	test	68.9	66.3	61.4	58.6
SeedBench	image	74.9	75.4	49.9	76.2
MMbench	en-dev	81.1	83.2	81.3	83.4
MMbench	en-test	80.1	80.8	75.0	-
MME	test	578/1603	418/1580	517/1409	-

Usage

A. Installation

git clone https://github.com/deepglint/unicom
cd unicom

# Upgrade pip and install necessary dependencies
pip install --upgrade pip
pip install -e ".[train]"

B. Inference

CUDA_VISIBLE_DEVICES=0 python infer.py --model_dir /path/to/your/model

# example:
# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
# >> Enter image file paths (comma-separated): ./asserts/logo.png
# >> User: <image>What kind of animal is it in this picture?
# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
# >> User: What color is this cat?
# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.

C. Evaluation for Embodied Ability

Step 1

Download raw data following OpenEQA and RoboVQA(val part)

Step 2

Converting raw data into the format required for model evaluation.

# convert OpenEQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_openeqa_bmk.py

# convert RoboVQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_robovqa_bmk.py

Step 3

Make sure that your top-level directory structure should look like this:

|--/path/to/your/benchmarks
|  |--OpenEQA
|  |  |--openeqa_scannet.parquet
|  |  |--openeqa_hm3d.parquet
|  |--RoboVQA
|     |--robovqa.parquet
|--/path/to/your/images
   |--openeqa_val
   |  |--scannet-v0
   |  |  |--002-scannet-scene0709_00
   |  |  |--xxx-scannet-scenexxxx_xx
   |  |--hm3d-v0
   |     |--000-hm3d-BFRyYbPCCPE
   |     |--xxx-hm3d-xxxxxxxxxxx
   |--robovqa_val
      |--robovqa_221911
      |--robovqa_xxxxxx

Step 4

Run script for evaluation

# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
bash scripts/eval/eval_robo.sh /path/to/your/model

D. Evaluation for General Ability

Install the evaluation tool and execute the evaluation script:

pip install lmms-eval==0.2.0
PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
    --main_process_port=12444 \
    --num_processes=8 \
    -m lmms_eval \
    --model llava \
    --model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
    --tasks mme \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix mlcd \
    --output_path ./eval_log/