Studying media discourse on Open Source Software in French.
Corpus: 2,683 news articles (1995-2025).
Needed: Sentiment Analysis over time.
Traditional tool: Lexicoder Sentiment Dictionary (LSD).
The thought: What if we used Large Language Models (LLMs) instead?
The Challenge: LLMs & Language
Most Foundational LLMs
Predominantly trained on English data.
Performance in other languages? Less certain.
Our Research Questions
Can general-purpose LLMs accurately evaluate sentiment in French texts without fine-tuning?
How do open-source LLMs perform compared to closed-source?
Does the language of the prompt (French vs. English) affect performance?
Why Does This Matter?
For Social Science Research:
Lower Barrier: Less ML expertise needed.
Capture Nuance: Go beyond dictionary limits (potentially).
Analyze Multilingually: Work directly with non-English text.
Process Locally: Enhance privacy with open models.
Broader Impact:
Advance Multilingual Tech: Reduce English bias in NLP.
Integrate AI Methods: Adapt new tools for research.
Understand AI Capabilities: Evaluate influential technology.
Democratize Tools: Broaden access to advanced NLP.
The Literature Gap
What We Knew:
Specialized models work.
Fine-tuning helps.
English prompts might aid cross-lingual tasks.
What We Didn’t Know:
General LLMs w/o fine-tuning?
Open-source vs. closed?
Impact of prompt language for French?
LLMs vs. dictionaries?
Approach: Systematic Evaluation
Data
French News Corpus about FOSS
200 human annotated sentences
Manual Annotation (-1 to 1)
Google Translation from French to English
Methods
11 LLMs (Open/Closed)
3 Linguistic Conditions
The mean of 3 runs for each prompt set
19,800 Prompts
Compared: English and French Lexicoder Dictionaries
Metrics: Corr, MAE, F1 (3/7 cat)
Before We Begin:
Warning:
Take the results with a grain of salt.
Only 200 sentences
Single coder
The Contenders: 11 LLMs
Closed Source:
Anthropic: Claude 3.5 Haiku
Google: Gemini 2.0 Flash
DeepSeek Chat
OpenAI: GPT-4o
Open Source (Weights):
Meta: Llama 3.2 (1B, 3B)
Google: Gemma 2 (9B)
Mistral: Saba (24B)
Ali Baba: QWQ (32B)
Meta: Llama 3.3 (70B)
DeepSeek R1 Basic (671B)
The Experiment
Prompt Structure
Please analyze the sentiment of the following French text and provide a singlenumerical rating according to this scale:Sentiment Scale:-1.0: Strong negative sentiment...// ... (Scale definitions) ... // 1.0: Strong positive sentiment...Important instructions:1. ...consider cultural and linguistic nuances...2. ...analyze emotional tone, word choice...3. Respond ONLY with a single numerical value...4. Do not include ANY explanations...Here is the text to analyze: [TEXT]
Three prompts for each language condition
Finding 1
LLMs can analyze sentiment in French
Key Points:
Top Closed Models (DeepSeek Chat, GPT-4o, Gemini, Claude): r > 0.65
Best Open Models (Llama 70B, DeepSeek R1): Competitive
English Dictionary (LSD): Solid baseline r = 0.53
Finding 2
minimal practical impact of prompt language choice.
Condition
Avg. Corr
Avg. F1 (3-cat)
Avg. MAE (Lower is Better)
FR → FR
0.537
0.539
0.260
EN → FR
0.543
0.546
0.259
EN → EN
0.520
0.557
0.252
Differences between prompt/text language conditions are consistently small across all metrics.
No single condition consistently outperforms others:
EN→FR leads slightly in Correlation.
EN→EN leads slightly in F1 scores and has the lowest avg. error (MAE) but only on average.
Finding 3
LLMs vs. Dictionary
Correlation (Nuance):
🥇 Top LLMs Win: Better intensity capture (r=0.71).
🥈 LSD Baseline: Less nuanced (EN r=0.53).
F1 Score (3-Cat Polarity):
🥇 Top LLMs Win (narrowly): F1 ≈ 0.71.
🥈 EN LSD Very Competitive: F1 = 0.625!
Insight: LLMs better for intensity, but LSD holds its own for basic polarity and offers transparency.
Finding 4
Bigger parameters size is better
Finding 5
Dictionaries are still strong
So What? Practical Implications
For Social Scientists:
Use LLMs: Top closed & larger open models work without fine-tuning.
Prompt Flexibility: French or English likely fine; direct analysis slightly better.
Consider Trade-offs:
LLMs: Nuance Transparency
LSD: Nuance Transparency
Keep In Mind: Limitations
Data & Scope Caveats:
Ground Truth Reliability:
Single coder (No Inter-Annotator Agreement)
Small validation sample (200 sentences)
May not capture full linguistic variety
Generalizability Limits:
French language only (Not other languages)
Specific domain/genre (News / Open Source)
Method & Context Caveats:
Methodological Factors:
Translation quality (Google Translate) not assessed
Fixed prompt structure (Model-specific tuning?)
Inconsistent LLM output formats (Parsing needed)
Study Parameters:
LLM selection based on accessibility/cost
No efficiency analysis (Time/Cost)
Snapshot in time (Rapid LLM evolution)
Conclusion
Can general-purpose LLMs analyze sentiment in French without fine-tuning?Yes.
Key Takeaways:
Leading LLMs offer powerful, accessible analysis.
Open-source models viable (bigger = better).
Prompt language less critical here.
LSD remains strong & transparent for polarity & macro-trends.
The Bottom Line: Choose based on nuance vs. transparency. LLMs are valuable tools for non-English text.