Why a hedge fund needed a custom pipeline
The SEC publishes all company filings on EDGAR. The data is free. The official API at data.sec.gov serves structured financial data well. So why would a quant hedge fund pay for a custom pipeline?
Because their research questions required data the official API does not structure.
The fund's quantitative models used both structured financial metrics (revenue, operating income, margins) and unstructured signals extracted from the narrative sections of filings. Specifically:
- Risk Factor evolution. How risk disclosure language changes between consecutive 10-K filings for the same company. New risk factors appearing, old ones being removed, or language shifting from "we may face" to "we are currently experiencing."
- MD&A sentiment and specificity. Whether Management's Discussion and Analysis is vague ("results were impacted by market conditions") or specific ("Azure revenue grew 31% driven by AI workload migration"). Specificity correlates with management confidence, according to the fund's research.
- Legal proceedings tracking. New litigation disclosures, resolution of existing cases, and changes in estimated liabilities across reporting periods.
None of these are available as structured data through the official API. The API covers XBRL-tagged financial line items. The narrative sections are prose that requires parsing.
What the pipeline does
Filing detection. The pipeline monitors the SEC's RSS feed continuously and detects new filings within 60 seconds of publication. It filters to a configurable watchlist of 1,200 tickers covering the fund's investable universe.
XBRL parsing. For structured financial data, the pipeline parses both inline XBRL (iXBRL) and traditional XBRL formats. It maps US-GAAP and IFRS taxonomy elements to a unified schema with 180+ standardized financial metrics. Custom extension taxonomies (where a filer reports a metric using a non-standard tag) are resolved against the base taxonomy at parse time, with ambiguous mappings flagged for analyst review.
Narrative extraction. For 10-K filings, the pipeline extracts the Risk Factors, MD&A, Legal Proceedings, and Critical Accounting Policies sections as structured text with headings preserved. Risk Factors are further parsed into individual risk items and classified by category (cybersecurity, regulatory, competitive, macroeconomic, litigation, etc.).
Normalization. The output normalizes across filers for consistent comparisons. Different fiscal year ends, restated figures, segment-level vs consolidated data, and unit/scale factor differences are all resolved automatically.
Delivery. Parsed data is pushed to the fund's PostgreSQL database via webhook. A parallel Parquet archive is written for backtesting pipelines. Average end-to-end latency from SEC publication to structured data delivery: 3.2 minutes.
The numbers
- 1,200 tickers on the watchlist
- 2,000+ filings processed per day across all form types
- 180+ normalized financial metrics per filing
- 3.2-minute average latency from publication to delivery
- 99.4% parse accuracy across all metrics (verified against a manually audited sample)
- 60-second detection window for new filings
What XBRL does not cover
This is the core of why the pipeline exists. XBRL (eXtensible Business Reporting Language) tags structured financial data beautifully. Revenue, operating income, total assets, EPS, all the standard line items.
But XBRL tagging requirements explicitly exclude the narrative sections of a 10-K:
- Management's Discussion and Analysis (MD&A) is prose. The numbers it references may be tagged, but the narrative context is not.
- Risk Factors are prose. The SEC requires companies to disclose material risks, but the format is unstructured text, not tagged data.
- Legal Proceedings are prose describing pending and resolved litigation.
- Critical Accounting Policies describe the judgments management made in preparing the financials.
- Footnotes provide context for the tagged numbers (revenue recognition policies, segment definitions, lease obligations) but are not themselves tagged in most filings.
This is not a temporary gap. The SEC has not indicated plans to extend XBRL tagging to narrative sections. The structured-vs-prose divide is a permanent feature of the filing system. For any research question that depends on the narrative, custom parsing is the only option.
What went wrong
The extension taxonomy problem
About 15% of XBRL filings use custom extension elements for at least one metric the fund's model depends on. A company reports "CloudServiceRevenue" instead of the standard "Revenue" tag because their revenue disclosure breaks out cloud separately.
The initial version of the pipeline treated extensions as unmappable and flagged them for manual review. At 2,000+ filings per day with 15% using extensions, the manual review queue was unsustainable.
The fix was building a taxonomy resolution layer that maps extensions to the nearest base taxonomy element using the extension's documentation, parent hierarchy, and calculation relationships. This automated the resolution for about 90% of extensions. The remaining 10% are genuinely ambiguous and still get flagged, but the queue dropped from hundreds per day to about 20.
The iXBRL rendering problem
Inline XBRL embeds financial data tags directly in the HTML filing document. Parsing iXBRL correctly requires rendering the HTML and then extracting the embedded XBRL tags from the rendered DOM. Some filings have iXBRL tags nested inside complex HTML structures (tables within tables, merged cells, conditional formatting) that cause the parser to misalign tag boundaries.
The initial accuracy rate for iXBRL parsing was about 96%. After building a DOM-aware parser that handles the most common nesting patterns, accuracy improved to 99.4%. The remaining 0.6% of errors are concentrated in a handful of filers with unusually complex HTML structures, and those filings are routed to a fallback parser with stricter but slower extraction logic.
The latency spike during earnings season
The SEC publishes a flood of filings during earnings weeks. The pipeline processes them in order of the watchlist priority (the fund's top holdings first), but during peak periods the queue can grow faster than the parsing throughput.
The fix was adding a priority tier to the watchlist: Tier 1 (top 100 holdings) always processes first, Tier 2 (next 400) processes second, and Tier 3 (remaining 700) fills in during off-peak. This guarantees the fund's most important positions always get sub-5-minute delivery even during earnings week peaks.
What I learned
The parsing is 20% of the work. The normalization is 80%. Getting the raw data out of a filing is straightforward. Making it consistent and comparable across 1,200 filers with different fiscal years, different segment structures, different extension taxonomies, and different reporting quirks is where the real engineering lives.
Latency matters less than you think, except when it matters a lot. For most filings, 10 minutes vs 3 minutes makes no difference. For 8-K filings that contain material events (earnings surprises, management changes, acquisition announcements), those 7 minutes can move prices. The pipeline is optimized for the 8-K latency case, which benefits all other filing types as well.
The official API is good. Use it. For structured financial data, the SEC API is free and reliable. The custom pipeline does not replace it for those use cases. It complements it for the narrative data and real-time delivery that the API does not cover.
The SEC EDGAR data service page has the full pricing, schema, and delivery details.