Beyond Language Barriers

Can LLMs Analyze Sentiment in French Text?

Laurence-Olivier M. Foisy

Université Laval

Camille Pelletier
Étienne Proulx
Sarah-Jane Vincent
Mickael Temporão
Yannick Dufresne

April 4, 2025

Research Journey

  • Studying media discourse on Open Source Software in French.
  • Corpus: 2,683 news articles (1995-2025).
  • Needed: Sentiment Analysis over time.
  • Traditional tool: Lexicoder Sentiment Dictionary (LSD).
  • The thought: What if we used Large Language Models (LLMs) instead?

The Challenge: LLMs & Language

Most Foundational LLMs

  • Predominantly trained on English data.
  • Performance in other languages? Less certain.

Our Research Questions

  1. Can general-purpose LLMs accurately evaluate sentiment in French texts without fine-tuning?
  2. How do open-source LLMs perform compared to closed-source?
  3. Does the language of the prompt (French vs. English) affect performance?

Why Does This Matter?

For Social Science Research:

  • Lower Barrier: Less ML expertise needed.
  • Capture Nuance: Go beyond dictionary limits (potentially).
  • Analyze Multilingually: Work directly with non-English text.
  • Process Locally: Enhance privacy with open models.

Broader Impact:

  • Advance Multilingual Tech: Reduce English bias in NLP.
  • Integrate AI Methods: Adapt new tools for research.
  • Understand AI Capabilities: Evaluate influential technology.
  • Democratize Tools: Broaden access to advanced NLP.

The Literature Gap

What We Knew:

  • Specialized models work.
  • Fine-tuning helps.
  • English prompts might aid cross-lingual tasks.

What We Didn’t Know:

  • General LLMs w/o fine-tuning?
  • Open-source vs. closed?
  • Impact of prompt language for French?
  • LLMs vs. dictionaries?

Approach: Systematic Evaluation

Data

  • French News Corpus about FOSS
  • 200 human annotated sentences
  • Manual Annotation (-1 to 1)
  • Google Translation from French to English

Methods

  • 11 LLMs (Open/Closed)
  • 3 Linguistic Conditions
  • The mean of 3 runs for each prompt set
  • 19,800 Prompts
  • Compared: English and French Lexicoder Dictionaries
  • Metrics: Corr, MAE, F1 (3/7 cat)

Before We Begin:

Warning:

  • Take the results with a grain of salt.
  • Only 200 sentences
  • Single coder

The Contenders: 11 LLMs

Closed Source:

  • Anthropic: Claude 3.5 Haiku
  • Google: Gemini 2.0 Flash
  • DeepSeek Chat
  • OpenAI: GPT-4o

Open Source (Weights):

  • Meta: Llama 3.2 (1B, 3B)
  • Google: Gemma 2 (9B)
  • Mistral: Saba (24B)
  • Ali Baba: QWQ (32B)
  • Meta: Llama 3.3 (70B)
  • DeepSeek R1 Basic (671B)

The Experiment

LLM Sentiment Analysis Workflow Input Data 200 FR Sentences (`df$sentences`) + EN Translations (`df$sentences_en`) Condition 1: FR → FR FR Prompt + FR Text (`df$sentences`) Condition 2: EN → FR EN Prompt + FR Text (`df$sentences`) Condition 3: EN → EN EN Prompt + EN Text (`df$sentences_en`) Apply to each sentence LLM Execution (via ellmer::chat_*) 11 Models (GPT-4o, Claude 3.5, Gemini 2.0, DeepSeek, Llama 3.x, Gemma 2, QwQ, Mistral) (Each Condition processed by Each Model) For each Sentence/Model/Condition Robustness: Run 3x & Average Mean( Run1, Run2, Run3 ) Store Results Sentiment Score [-1.0, 1.0] In DataFrame (`df$model_condition`) Total: 19,800 API Calls (11 models × 3 conditions × 200 sentences × 3 runs)

Prompt Structure

Please analyze the sentiment of the following French text and provide a single
numerical rating according to this scale:

Sentiment Scale:
-1.0: Strong negative sentiment...
// ... (Scale definitions) ... //
 1.0: Strong positive sentiment...

Important instructions:
1. ...consider cultural and linguistic nuances...
2. ...analyze emotional tone, word choice...
3. Respond ONLY with a single numerical value...
4. Do not include ANY explanations...

Here is the text to analyze: [TEXT]

Three prompts for each language condition

Finding 1

LLMs can analyze sentiment in French

Key Points:

  • Top Closed Models (DeepSeek Chat, GPT-4o, Gemini, Claude): r > 0.65
  • Best Open Models (Llama 70B, DeepSeek R1): Competitive
  • English Dictionary (LSD): Solid baseline r = 0.53

Finding 2

minimal practical impact of prompt language choice.

Condition Avg. Corr Avg. F1 (3-cat) Avg. MAE (Lower is Better)
FR → FR 0.537 0.539 0.260
EN → FR 0.543 0.546 0.259
EN → EN 0.520 0.557 0.252
  • Differences between prompt/text language conditions are consistently small across all metrics.
  • No single condition consistently outperforms others:
    • EN→FR leads slightly in Correlation.
    • EN→EN leads slightly in F1 scores and has the lowest avg. error (MAE) but only on average.

Finding 3

LLMs vs. Dictionary

Correlation (Nuance):

  • 🥇 Top LLMs Win: Better intensity capture (r=0.71).
  • 🥈 LSD Baseline: Less nuanced (EN r=0.53).

F1 Score (3-Cat Polarity):

  • 🥇 Top LLMs Win (narrowly): F1 ≈ 0.71.
  • 🥈 EN LSD Very Competitive: F1 = 0.625!

Insight: LLMs better for intensity, but LSD holds its own for basic polarity and offers transparency.

Finding 4

Bigger parameters size is better

Finding 5

Dictionaries are still strong

So What? Practical Implications

For Social Scientists:

  • Use LLMs: Top closed & larger open models work without fine-tuning.
  • Prompt Flexibility: French or English likely fine; direct analysis slightly better.
  • Consider Trade-offs:
    • LLMs: Nuance Transparency
    • LSD: Nuance Transparency

Keep In Mind: Limitations

Data & Scope Caveats:

  • Ground Truth Reliability:
    • Single coder (No Inter-Annotator Agreement)
    • Small validation sample (200 sentences)
    • May not capture full linguistic variety
  • Generalizability Limits:
    • French language only (Not other languages)
    • Specific domain/genre (News / Open Source)

Method & Context Caveats:

  • Methodological Factors:
    • Translation quality (Google Translate) not assessed
    • Fixed prompt structure (Model-specific tuning?)
    • Inconsistent LLM output formats (Parsing needed)
  • Study Parameters:
    • LLM selection based on accessibility/cost
    • No efficiency analysis (Time/Cost)
    • Snapshot in time (Rapid LLM evolution)

Conclusion

Can general-purpose LLMs analyze sentiment in French without fine-tuning? Yes.

Key Takeaways:

  • Leading LLMs offer powerful, accessible analysis.
  • Open-source models viable (bigger = better).
  • Prompt language less critical here.
  • LSD remains strong & transparent for polarity & macro-trends.

The Bottom Line: Choose based on nuance vs. transparency. LLMs are valuable tools for non-English text.

Thank You!

Questions?

Laurence-Olivier M. Foisy

mail@mfoisy.com

Co-authors: Camille Pelletier, Étienne Proulx, Sarah-Jane Vincent, Mickael Temporão, Yannick Dufresne

Full paper/details available in GitHub