Skip to the content.

Narrative Framing for Air Pollution, Energy Transition, Animal Welfare

I identify and track a set of narrative framings across text corpora—from media articles to TV news transcripts, radio programs, forums, Reddit, and other sources—on different topics using LLMs and other NLP techniques. This helps see how issues are discussed, detect trends and shifts, surface outlets/journalists to prioritize, inform advocacy and potentially gauge intermediate impact.
This post is part of a series of technical explorations for Effective Advocacy. The goal is to devise practical tools that help advocacy better inform their strategy and measure their impact. Anticipated applications include narrative framing, strategic actors mapping, and key findings dissemination.

Why narrative framing?

Narrative framing analyses could serve multiple purposes:

Examples

Jakarta — Air pollution causes

Context: Analysis of Jan 2020– Oct 2025 Indonesian media coverage on air pollution in Jakarta, focusing on how different causes are discussed. The corpus spans 14,469 articles from major Indonesian outlets in Bahasa Indonesia, capturing how journalists frame pollution sources—from vehicle emissions to seasonal weather patterns.

Results summary: Transport emissions dominate coverage (41% of articles), reflecting Jakarta’s heavy traffic and vehicle-related pollution discourse. Natural and meteorological factors come next with a score of 8.5% articles, with notable seasonal spikes during dry periods when weather conditions exacerbate pollution.

Frame share over time

Frames identified:

Frame Description Key Keywords Share
Transport Emissions Vehicle emissions from cars, motorcycles, buses, trucks, and road traffic kendaraan bermotor, lalu lintas, emisi kendaraan, uji emisi 41.1%
Natural Factors Meteorological and seasonal factors affecting air quality (weather patterns, El Niño, rainfall) cuaca, angin, musim kemarau, El Nino, curah hujan rendah 8.5%
Industrial Emissions Factory and manufacturing emissions, including smelters, steel, and cement production pabrik, industri, smelter, industri baja, industri semen 6.3%
Power Plant Emissions Coal-fired and fossil-fuel power plant emissions PLTU, pembangkit listrik, coal-fired power plant 3.3%
Biomass Burning Agricultural fires, forest fires, and land clearing through burning pembakaran lahan, kebakaran hutan, pembakaran biomassa 2.1%
Waste Burning Open burning of municipal waste and landfill fires pembakaran sampah, open burning, landfill fire 1.9%
Household Emissions Household cooking and heating using fossil fuels or biomass pembakaran rumah tangga, kompor kayu, bahan bakar padat 0.5%
Construction Dust Construction activities, roadworks, and resuspended dust debu konstruksi, pembangunan, road dust, pekerjaan jalan 0.4%

Note: Percentages represent the share of articles that discuss each frame (occurrence-based, threshold ≥0.2). Articles can discuss multiple frames.

Philippines — Renewable energy

Brazil — Animal welfare

Method overview

The pipeline follows a hybrid LLM-to-classifier approach: we start with flexible LLM exploration to discover and annotate narrative frames, then scale up with a fine-tuned transformer classifier. This balances domain adaptability (frames tailored to each question and context) with computational efficiency (fast inference over large corpora).

flowchart LR
    subgraph Collection["1. Data Collection & Preparation"]
        direction TB
        subgraph CollectionSub[ ]
        direction TB
        A["Content discovery<br/>(media, TV, radio, forums, etc.)"] 
        A2["Scrape & extract text<br/>(using Scrapy)"]
        B["Chunk text<br/>(using SpaCy language model)"]
        A --> A2 --> B
        end
    end
    
    subgraph Discovery["2. Frame Induction & Annotation"]
        direction TB
        subgraph DiscoverySub[ ]
        direction TB
        C["LLM: Induce frames<br/>(with or without user guidance)"]
        D["LLM: Label samples<br/>(multi-label distributions)"]
        C --> D
        end
    end
    
    

    subgraph Classification["3. Scalable Classification"]
        direction TB
        subgraph ClassificationSub[ ]
        direction TB
        E["Train transformer classifier<br/>(fine-tune on LLM labels)"]
        F["Classify all chunks<br/>(fast inference)"]
        E --> F
        end
    end
    
    subgraph Analysis["4. Aggregation & Reporting"]
        direction TB
        subgraph AnalysisSub[ ]
        direction TB
        G["Aggregate to document level<br/>(length-weighted attention)"]
        H["Results analysis<br/>(e.g. time series & outlets breakdowns)"]
        I["Generate reports<br/>(interactive HTML + static plots)"]
        G --> H --> I
        end
    end
    
    Collection --> Discovery
    Discovery --> Classification
    Classification --> Analysis
    
    classDef nodeBox fill:#ffffff33,stroke:#333,stroke-width:1px
    classDef somePaddingClass padding-bottom:5em
    classDef transparent fill:#ffffff00,stroke-width:0
    
    Collection:::somePaddingClass
    CollectionSub:::transparent
    Discovery:::discoveryStyle
    Discovery:::somePaddingClass
    DiscoverySub:::transparent
    Classification:::somePaddingClass
    ClassificationSub:::transparent
    Analysis:::analysisStyle
    Analysis:::somePaddingClass
    AnalysisSub:::transparent

    
    class A,A2,B,C,D,E,F,G,H,I nodeBox
    style Collection fill:#e1f5ff,stroke:#0277bd,stroke-width:2px
    style Discovery fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
    style Classification fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    style Analysis fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
    

Content discovery (search/filters): We start by defining the slice of content we care about—whether from media articles, TV news transcripts, radio programs, forums, Reddit, or other sources—in a way that is both broad enough to catch variation and precise enough to be actionable. For media analysis, using Media Cloud collections lets us anchor each run in a country and time window, and then layer topical filters (for instance, city names or issue cues) to focus coverage. Similar approaches work for other platforms: TV news and radio transcripts, forum posts, Reddit threads, or other text corpora can be collected through their respective APIs or scraping tools. The intent is to bias toward recall at this stage: we would rather include a few borderline documents and filter them downstream than miss legitimate phrasing that differs from our initial keywords. Every run is captured in a small YAML file so the choices are explicit and replicable.

Scrape and extract: To reason about narratives we need full passages, not just headlines or snippets. We fetch pages and extract the main text, then remove boilerplate and navigation tails that otherwise drown the signal (things like widgets, “follow us” blocks, or stock tickers). The trimming rules live in config so we can adapt them by outlet or country. This step trades a little engineering effort for cleaner inputs and more stable downstream classification.

Frame induction (LLM): We ask an LLM to propose a compact set of categories tailored to the question and context (e.g., causes of air pollution in Jakarta) by feeding it a random sample of passages (200 passages in the examples above) in several consecutive batches, followed by a consolidation call. User can inject guidance to guide the LLM e.g. to include or exclude certain frames. After a manual and shallow comparison of various models performances through visual inspection of framing results, I selected OpenAI GPT‑4.1 for this step. The resulting schema (names, short definitions, examples, keywords) is passed along to the annotation step.

Frame application to samples (LLM): We then use another LLM as a probabilistic annotator on a sample of passages (typically 2,000 passages in the examples above). Each passage gets a distribution over frames (not just a single label) plus a brief rationale. We typically use a smaller GPT‑4 variant (e.g., gpt-4.1-mini) for this step to balance cost and quality, since we need to label thousands of examples. This does two things: it reveals ambiguous cases that keyword-based approaches would mis-label, and it gives us enough labeled data to train a supervised model.

Supervised classifier (transformers): We then fine‑tune a multi‑label transformer classifier on those LLM‑labeled passages using Hugging Face transformers. We start with a pre-trained language model (e.g., indobenchmark/indobert-base-p1 for Bahasa Indonesia, distilbert-base-uncased for English) and adapt it to our frame classification task: the encoder layers learn to recognize frame-relevant patterns, while a new classification head outputs probability scores for each frame using sigmoid activation. This gives us cheap, fast inference over tens of thousands of chunks while freezing the labeling policy defined by the schema.

Classify the corpus: We classify content at the chunk level (typically sentences or short spans) to avoid burying weaker frames in long documents. Light keyword gating and regex excludes from earlier steps help keep us on topic without reintroducing brittle rules. Results are cached per document to support iterative runs and easy re‑aggregation.

Aggregate and report: Finally, we aggregate chunk‑level predictions to document‑level profiles and summaries over time. A length‑weighted aggregator estimates how much attention each frame receives within a document (article, post, thread, etc.); an occurrence view answers a different question—what share of documents mention a frame at all.

Why not simply use keywords?

Keyword-based approaches have significant limitations for narrative analysis:

Our approach uses LLMs to capture semantic meaning, then scales it with a classifier—combining the flexibility of language understanding with the efficiency needed for large-scale analysis.


Get in touch

I am interested in hearing from others working on similar problems or exploring how these tools could be applied in new contexts or further developed to be more useful. Whether you have ideas for improvements, questions about the approach, or want to collaborate on applications, I’d love to hear from you - reach out to me.