Beyond Language Barriers

Can LLMs Analyze Sentiment in French Text?

Laurence-Olivier M. Foisy

mail@mfoisy.com

Université Laval

Camille Pelletier

Étienne Proulx

Sarah-Jane Vincent

Mickael Temporão

Yannick Dufresne

April 4, 2025

Research Journey

Studying media discourse on Open Source Software in French.
Corpus: 2,683 news articles (1995-2025).
Needed: Sentiment Analysis over time.
Traditional tool: Lexicoder Sentiment Dictionary (LSD).
The thought: What if we used Large Language Models (LLMs) instead?

The Challenge: LLMs & Language

Most Foundational LLMs

Predominantly trained on English data.
Performance in other languages? Less certain.

Our Research Questions

Can general-purpose LLMs accurately evaluate sentiment in French texts without fine-tuning?
How do open-source LLMs perform compared to closed-source?
Does the language of the prompt (French vs. English) affect performance?

Why Does This Matter?

For Social Science Research:

Lower Barrier: Less ML expertise needed.
Capture Nuance: Go beyond dictionary limits (potentially).
Analyze Multilingually: Work directly with non-English text.
Process Locally: Enhance privacy with open models.

Broader Impact:

Advance Multilingual Tech: Reduce English bias in NLP.
Integrate AI Methods: Adapt new tools for research.
Understand AI Capabilities: Evaluate influential technology.
Democratize Tools: Broaden access to advanced NLP.

The Literature Gap

What We Knew:

Specialized models work.
Fine-tuning helps.
English prompts might aid cross-lingual tasks.

What We Didn’t Know:

General LLMs w/o fine-tuning?
Open-source vs. closed?
Impact of prompt language for French?
LLMs vs. dictionaries?

Approach: Systematic Evaluation

Data

French News Corpus about FOSS
200 human annotated sentences
Manual Annotation (-1 to 1)
Google Translation from French to English

Methods

11 LLMs (Open/Closed)
3 Linguistic Conditions
The mean of 3 runs for each prompt set
19,800 Prompts
Compared: English and French Lexicoder Dictionaries
Metrics: Corr, MAE, F1 (3/7 cat)

Before We Begin:

Warning:

Take the results with a grain of salt.
Only 200 sentences
Single coder

The Contenders: 11 LLMs

Closed Source:

Anthropic: Claude 3.5 Haiku
Google: Gemini 2.0 Flash
DeepSeek Chat
OpenAI: GPT-4o

Open Source (Weights):

Meta: Llama 3.2 (1B, 3B)
Google: Gemma 2 (9B)
Mistral: Saba (24B)
Ali Baba: QWQ (32B)
Meta: Llama 3.3 (70B)
DeepSeek R1 Basic (671B)

The Experiment

Prompt Structure

Please analyze the sentiment of the following French text and provide a single
numerical rating according to this scale:

Sentiment Scale:
-1.0: Strong negative sentiment...
// ... (Scale definitions) ... //
 1.0: Strong positive sentiment...

Important instructions:
1. ...consider cultural and linguistic nuances...
2. ...analyze emotional tone, word choice...
3. Respond ONLY with a single numerical value...
4. Do not include ANY explanations...

Here is the text to analyze: [TEXT]

Three prompts for each language condition

Finding 1

LLMs can analyze sentiment in French

Key Points:

Top Closed Models (DeepSeek Chat, GPT-4o, Gemini, Claude): r > 0.65
Best Open Models (Llama 70B, DeepSeek R1): Competitive
English Dictionary (LSD): Solid baseline r = 0.53

Finding 2

minimal practical impact of prompt language choice.

Condition	Avg. Corr	Avg. F1 (3-cat)	Avg. MAE (Lower is Better)
FR → FR	0.537	0.539	0.260
EN → FR	0.543	0.546	0.259
EN → EN	0.520	0.557	0.252

Differences between prompt/text language conditions are consistently small across all metrics.
No single condition consistently outperforms others:
- EN→FR leads slightly in Correlation.
- EN→EN leads slightly in F1 scores and has the lowest avg. error (MAE) but only on average.

Finding 3

LLMs vs. Dictionary

Correlation (Nuance):

🥇 Top LLMs Win: Better intensity capture (r=0.71).
🥈 LSD Baseline: Less nuanced (EN r=0.53).

F1 Score (3-Cat Polarity):

🥇 Top LLMs Win (narrowly): F1 ≈ 0.71.
🥈 EN LSD Very Competitive: F1 = 0.625!

Insight: LLMs better for intensity, but LSD holds its own for basic polarity and offers transparency.

Finding 4

Bigger parameters size is better

Finding 5

Dictionaries are still strong

So What? Practical Implications

For Social Scientists:

Use LLMs: Top closed & larger open models work without fine-tuning.
Prompt Flexibility: French or English likely fine; direct analysis slightly better.
Consider Trade-offs:
- LLMs: Nuance Transparency
- LSD: Nuance Transparency

Keep In Mind: Limitations

Data & Scope Caveats:

Ground Truth Reliability:
- Single coder (No Inter-Annotator Agreement)
- Small validation sample (200 sentences)
- May not capture full linguistic variety
Generalizability Limits:
- French language only (Not other languages)
- Specific domain/genre (News / Open Source)

Method & Context Caveats:

Methodological Factors:
- Translation quality (Google Translate) not assessed
- Fixed prompt structure (Model-specific tuning?)
- Inconsistent LLM output formats (Parsing needed)
Study Parameters:
- LLM selection based on accessibility/cost
- No efficiency analysis (Time/Cost)
- Snapshot in time (Rapid LLM evolution)

Conclusion

Can general-purpose LLMs analyze sentiment in French without fine-tuning? Yes.

Key Takeaways:

Leading LLMs offer powerful, accessible analysis.
Open-source models viable (bigger = better).
Prompt language less critical here.
LSD remains strong & transparent for polarity & macro-trends.

The Bottom Line: Choose based on nuance vs. transparency. LLMs are valuable tools for non-English text.

Thank You!

Questions?

Laurence-Olivier M. Foisy

mail@mfoisy.com

Co-authors: Camille Pelletier, Étienne Proulx, Sarah-Jane Vincent, Mickael Temporão, Yannick Dufresne

Full paper/details available in GitHub