Polish EQ-Bench Leaderboard

Leaderboard was created as part of an open-science project SpeakLeash.org

Polish Emotional Intelligence Benchmark for LLMs

Help us develop Polish Large Language Model Bielik by using Arena.

We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/016951.

Model	Params	Benchmark Score	Percentage Questions Parseable	Error
TeeZee/Bielik-SOLAR-LIKE-10.7B-Instruct-v0.1	22.2	59.579532163742684	51.461988304093566	133.0 questions were parseable (min is 83%)

Model	Params	Benchmark Score	Percentage Questions Parseable	Error
mistralai/Mistral-Large-Instruct-2407	123	78.07	100.00
mistralai/Mistral-Large-Instruct-2411	123	77.29	100.00
meta-llama/Meta-Llama-3.1-405B-Instruct-FP8	405	77.23	100.00
gpt-4o-2024-08-06	nan	75.15	100.00
gpt-4-turbo-2024-04-09	nan	74.59	95.91
speakleash/Bielik-11B-v2.6-Instruct	11	73.70	99.42
deepseek-ai/DeepSeek-V3-0324	685	73.46	100.00
mistralai/Mistral-Small-Instruct-2409	22	72.85	100.00
CYFRAGOVPL/Llama-PLLuM-70B-chat	70	72.56	99.42
meta-llama/Meta-Llama-3.1-70B-Instruct	70	72.53	100.00
speakleash/Bielik-11B-v2.5-Instruct	11	72.00	99.42
Qwen/Qwen2-72B-Instruct	72	71.23	98.83
meta-llama/Meta-Llama-3-70B-Instruct	70	71.21	100.00
speakleash/Bielik-11B-v3.0-Instruct	11	71.20	100.00
gpt-4o-mini-2024-07-18	nan	71.15	100.00
Qwen/Qwen2.5-32B-Instruct	32	71.15	100.00
speakleash/Bielik-11B-v2.3-Instruct	11	70.86	100.00
meta-llama/Llama-3.3-70B-Instruct	70	70.73	97.08
mistralai/Mistral-Small-24B-Instruct-2501	24	70.52	100.00
CYFRAGOVPL/Llama-PLLuM-70B-instruct	70	69.99	100.00
alpindale/WizardLM-2-8x22B	141	69.56	100.00
Qwen/Qwen2.5-14B-Instruct	14	69.17	99.42
speakleash/Bielik-11B-v2.2-Instruct	11	69.05	100.00
Qwen/Qwen2-72B	72	68.93	98.83
Qwen/Qwen2.5-72B-Instruct	72	68.49	99.42
speakleash/Bielik-11B-v2.0-Instruct	11	68.24	100.00
Qwen/Qwen1.5-72B-Chat	72	68.03	100.00
mistralai/Mixtral-8x22B-Instruct-v0.1	141	67.63	100.00
THUDM/glm-4-9b-chat	9	61.79	100.00
mistralai/Mistral-Nemo-Instruct-2407	12	61.76	100.00
speakleash/Bielik-11B-v2.1-Instruct	11	60.07	90.64
Qwen/Qwen1.5-32B-Chat	32	59.63	98.25
openchat/openchat-3.5-0106-gemma	7	59.58	99.42
openchat/openchat-3.5-0106-gemma	7	59.58	99.42
openchat/openchat-3.5-0106-gemma	7	59.41	98.83
openchat/openchat-3.5-0106-gemma	7	59.41	98.83
microsoft/phi-4	15	59.10	91.81
Qwen/Qwen2.5-7B-Instruct	7	58.58	100.00
CohereForAI/aya-23-35B	35	58.41	100.00
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	70	58.14	77.78	133.0 questions were parseable (min is 83%)
openchat/openchat-3.5-0106-gemma	7	57.93	98.83
gpt-3.5-turbo	nan	57.70	100.00
Qwen/Qwen2-57B-A14B-Instruct	57	57.64	100.00
mistralai/Mixtral-8x7B-Instruct-v0.1	47	57.61	98.25
CohereForAI/c4ai-command-r-v01	35	56.43	100.00
microsoft/Phi-3-medium-4k-instruct	14	56.40	98.83
upstage/SOLAR-10.7B-Instruct-v1.0	11	55.21	95.91
CYFRAGOVPL/pllum-12b-nc-chat-250715	12	55.17	94.74
NousResearch/Hermes-2-Theta-Llama-3-8B	8	54.88	100.00
mlabonne/NeuralDaredevil-8B-abliterated	8	54.74	100.00
NousResearch/Hermes-2-Pro-Llama-3-8B	8	54.57	100.00
upstage/SOLAR-10.7B-Instruct-v1.0	11	54.33	94.74
utter-project/EuroLLM-9B-Instruct	9	54.11	98.83
Qwen/Qwen1.5-32B	32	54.03	99.42
Qwen/Qwen2-7B-Instruct	7	53.74	100.00
speakleash/Bielik-4.5B-v3.0-Instruct	4	53.58	95.32
Qwen/Qwen2-7B-Instruct	7	53.08	100.00
google/recurrentgemma-9b-it	9	52.82	100.00
CYFRAGOVPL/PLLuM-12B-chat	12	52.26	91.23
Qwen/Qwen1.5-72B	72	51.44	95.32
microsoft/Phi-4-mini-instruct	4	50.52	99.42
berkeley-nest/Starling-LM-7B-alpha	7	49.63	100.00
NousResearch/Nous-Hermes-2-SOLAR-10.7B	11	49.27	98.83
openchat/openchat-3.5-1210	7	49.04	100.00
lex-hue/Delexa-7b	7	48.46	98.83
Qwen/Qwen1.5-14B-Chat	14	47.96	93.57
NousResearch/Nous-Hermes-2-SOLAR-10.7B	11	47.66	98.83
CYFRAGOVPL/PLLuM-8x7B-nc-chat	47	47.29	100.00
mistralai/Mistral-7B-Instruct-v0.2	7	47.02	88.30
meta-llama/Meta-Llama-3-8B-Instruct	8	46.53	100.00
01-ai/Yi-1.5-9B-Chat	9	46.50	95.32
01-ai/Yi-1.5-34B-Chat	34	46.32	100.00
meta-llama/Meta-Llama-3-8B-Instruct	8	46.27	100.00
berkeley-nest/Starling-LM-7B-alpha	7	46.26	100.00
CYFRAGOVPL/Llama-PLLuM-8B-chat	8	46.20	90.64
meta-llama/Llama-3.2-3B-Instruct	3	46.19	99.42
mistralai/Mistral-7B-Instruct-v0.2	7	45.86	86.55
CohereForAI/aya-23-8B	8	45.43	100.00
openchat/openchat-3.5-0106	7	45.42	99.42
openchat/openchat-3.5-0106	7	45.42	99.42
openchat/openchat-3.5-0106	7	45.42	99.42
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16	32	45.28	98.83
CYFRAGOVPL/PLLuM-8x7B-chat	47	45.22	100.00
mistralai/Mistral-7B-Instruct-v0.3	7	45.21	100.00
Remek/Kruk-7B-SP-001	7	44.44	100.00
openchat/openchat-3.5-0106	7	43.81	100.00
Nexusflow/Starling-LM-7B-beta	7	43.78	97.08
Remek/OpenChat3.5-0106-Spichlerz-Bocian	7	42.84	97.08
tiiuae/falcon-11B	11	42.41	100.00
Nexusflow/Starling-LM-7B-beta	7	41.76	92.98
CYFRAGOVPL/PLLuM-8x7B-nc-instruct	47	41.75	100.00
Remek/OpenChat3.5-0106-Spichlerz-Inst-001	7	41.60	100.00
internlm/internlm2-chat-7b-sft	7	41.38	99.42

Plot

Authors:

Automatic translation: Remigiusz Kinas
Translation proofreading and localization: Maria Filipkowska, Zuzanna Dabić
Preparing dataset: Kacper Milan
Running benchmark and leaderboard: Krzysztof Wróbel

Based on: EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models, Samuel J. Paech, 2023

File

output .csv

40.6 KB ⇣