RulingIQ Data Methodology: How We Score Judicial Behavior

Judicial analytics is only as useful as the underlying data is reliable. At RulingIQ, we believe attorneys deserve to understand exactly how our scores and profiles are constructed — not just the outputs, but the methodology behind them. This post provides a detailed and honest account of how we build our judicial intelligence data.

Primary Data Sources

RulingIQ builds its judicial profiles from three primary data sources:

1. CourtListener (Free Law Project)

CourtListener is the largest freely accessible database of federal court opinions in the United States, maintained by the non-profit Free Law Project. Our pipeline ingests CourtListener's full opinion database, which covers federal district and appellate courts with particular depth from 2005 onward. We use both the opinion corpus (full text of published decisions) and the RECAP archive (docket entries crowdsourced from PACER users).

Coverage notes: CourtListener's opinion coverage is excellent for the most active federal districts (N.D. Tex., S.D.N.Y., C.D. Cal., N.D. Cal.) and more sparse for smaller or lower-activity divisions. Our profiles display coverage quality indicators for each judge so users can assess how representative the data is.

2. Administrative Office of U.S. Courts Statistical Reports

The AO publishes annual judicial caseload statistics, including per-judge data on pending cases, time to disposition, and trial rates. These reports are the authoritative source for docket efficiency metrics. We incorporate AO data for all judges covered by the statistical reports, which include all active Article III district judges.

3. Federal Judicial Center Biographical Directory

The FJC's Biographical Directory of Article III Federal Judges provides authoritative data on appointment history, prior career, law school, and clerkship history. This is the source for the appointment-context section of every judge profile.

Opinion Classification: Motion Type and Outcome

Extracting motion grant rates from judicial opinions and docket entries requires two classification steps: identifying what type of motion the opinion addresses, and classifying the outcome as favorable to the movant, unfavorable, or mixed.

Motion Type Classification

We trained a text classifier on a labeled dataset of 15,000 federal court opinions, with human-reviewed labels for motion type across 12 categories including 12(b)(6), 12(b)(1), Rule 56 summary judgment (plaintiff-filed and defendant-filed separately), Rule 23 class certification, Daubert/Rule 702, and Rule 50 judgment as a matter of law.

The classifier achieves precision and recall above 91% on the held-out test set for all high-frequency motion types. For lower-frequency categories (Daubert, Rule 50), we publish confidence intervals rather than point estimates when sample sizes are below 30.

Outcome Classification

Outcome classification — granted, denied, granted-in-part — is more difficult than motion type classification because federal judges write outcomes in widely varying language. "Plaintiff's motion is GRANTED," "The court finds that defendant has not demonstrated the absence of a genuine dispute," and "For the foregoing reasons, the motion is denied without prejudice" all require different parsing.

We use a combination of rule-based extraction (for standardized language in orders) and a fine-tuned language model for the longer-tail formulations. Outcome classification accuracy on the test set is 94% for binary grant/deny and 88% for three-way grant/deny/partial.

Where a minute order in the RECAP docket does not have associated opinion text, we classify from the docket entry text alone. This lower-quality signal is flagged in the data and weighted accordingly in metric calculations.

Baseline Construction

All motion grant rates and bias scores are computed relative to a baseline. We construct baselines at the district level using a rolling 24-month window of all classified rulings within the district. This means each judge's metrics are compared to their current district peers — a judge who moves to senior status or takes a new assignment will have their baseline updated accordingly in the next quarterly refresh.

Case type normalization: raw grant rates are normalized for case type mix because the composition of cases before a judge affects raw rates. A judge who handles 60% complex securities cases will have different raw 12(b)(6) rates than one with 80% employment discrimination cases, even if they apply identical legal standards. We control for case type using a regression adjustment on the eight broadest AO case category codes.

The Bias Score: Construction and Interpretation

The RulingIQ Bias Score is a normalized index measuring the direction and magnitude of a judge's deviation from their district baseline on plaintiff-versus-defendant-favorable outcomes across all classified dispositive rulings.

Construction steps:

Classify each dispositive ruling as plaintiff-favorable (PF), defendant-favorable (DF), or neutral (N)
Compute the judge's PF rate across all dispositive rulings in the scoring window
Compute the district baseline PF rate using the same case-type normalization
Compute the deviation: judge PF rate minus baseline PF rate
Normalize this deviation to a -100 to +100 scale, where +100 would be a hypothetical judge who rules plaintiff-favorable on every dispositive motion and -100 would be the reverse. Practical scores cluster between -35 and +35; scores outside -50/+50 are rare and typically reflect thin data.

A positive bias score means the judge is more plaintiff-favorable than the district average. A negative score means more defendant-favorable. A score near zero means the judge tracks the district baseline closely.

We publish the score with a 95% confidence interval. Judges with fewer than 50 qualifying dispositive rulings receive a wide confidence interval; their point estimate is displayed with a visual indicator that the data is preliminary. We do not publish point estimates without confidence intervals, because a bare number without uncertainty information is more likely to mislead than inform.

Update Cadence and Staleness

RulingIQ profiles are refreshed quarterly. The full pipeline — data ingestion, classification, baseline computation, and score update — runs in the first week of each calendar quarter. Users can see the data vintage (last updated date) on every profile.

We do not publish real-time updates after individual rulings because single-ruling updates produce noisy, potentially misleading signals. A judge who issues five plaintiff-favorable rulings in one week has not changed their underlying tendencies; they may simply have had a busy week on a particular case category. Quarterly aggregation smooths this volatility.

For judges who recently received assignment to a new court, took senior status, or returned from a significant leave, we flag the profile with a staleness warning and display trend data that distinguishes pre- and post-change periods.

What We Do Not Claim

Honesty about limitations is central to methodological transparency. RulingIQ data does not predict individual case outcomes. It provides base rates and deviation scores that inform probabilistic thinking about judicial behavior — not deterministic predictions about what will happen in your case.

Our data does not capture:

Minute orders not available in RECAP (significant coverage gap in districts with low PACER crowdsourcing activity)
Unpublished orders that are never entered in electronic dockets
Courtroom behavior, demeanor, and preferences not reflected in written decisions
Changes in a judge's behavior that have not yet accumulated in the data window (a judge who had a major legal epiphany six months ago will not show that in quarterly aggregate data for another 2-3 quarters)

Peer Review and Validation

Our classification models are validated against held-out test sets with human-reviewed labels. We conduct quarterly audits of a random sample of classified rulings to check for model drift. When audits detect classification accuracy degradation, we retrain the affected model before the next data refresh.

We have shared our methodology with legal academics studying judicial behavior and welcome feedback on analytical approaches. The goal is judicial intelligence that is genuinely useful — and genuine usefulness requires honest representation of what the data does and does not show.

Conclusion

RulingIQ's judicial profiles are built on transparent, documented methodology using the best available public data sources. We are committed to publishing not just the scores but the confidence intervals, coverage quality indicators, and limitations that allow attorneys to use the data responsibly.

If you have methodological questions or want to discuss specific aspects of how a judge's profile is constructed, reach out through our contact page. We believe attorneys who understand the methodology are better positioned to use it effectively — and more likely to catch genuine errors when they occur.