How I Built a LinkedIn Enrichment Pipeline for 70,000 Profiles

The brief

A B2B sales team came to me with a CRM full of leads and a problem. They had 70,000 records with names and company names. That was it. No job titles, no seniority, no tenure, no location, no skills. Their outreach was generic because the data was generic.

They had tried browser extensions. Two reps lost their accounts in the first week. They had tried an enrichment platform. The match rate on their target audience (mid-market SaaS companies, specific titles) was around 55%. They needed something better and they needed it without risking their Sales Navigator seats.

What I built

The pipeline had four main components:

Input processing. The client exported their CRM records as a CSV with name and company name columns. Some records had company domains, which improved match accuracy. The pipeline cleaned the input: normalized company name variations, handled non-English characters, split combined name fields, and flagged obviously incomplete records before attempting any matching.

Profile extraction. Each input record was matched to a LinkedIn profile using a combination of name, company, and domain signals. The extraction ran on residential infrastructure entirely separate from the client's accounts. No browser extensions, no Sales Navigator API, no account risk for the client.

The extraction returned: full name, current title, current company, seniority classification, tenure in current role, location, experience count, skills, and a match confidence score from 0 to 100.

CRM write-back. Instead of delivering a CSV for the client to manually import, the pipeline wrote enriched fields directly back into their HubSpot instance via the CRM API. Each enriched record was matched to the existing contact by email or name+company key, and new fields were written in place. Duplicates were flagged for manual review rather than overwritten.

Confidence scoring. Every enriched record included a confidence score. Above 85: high confidence match, safe to use for outreach. 60 to 85: probable match, recommended for manual spot-check before outreach. Below 60: low confidence, flagged but not written to CRM without client approval.

The numbers

70,000 profiles processed over 3 weeks
87% match rate to the client's CRM records (name + company input)
93% match rate on records where company domain was also provided
0 account bans across the entire engagement
3.2x reply rate improvement on outreach using enriched vs non-enriched contacts

The 13% that did not match fell into three buckets: very common names with insufficient disambiguation signals (about 5%), private or deactivated profiles (about 4%), and profiles on non-English LinkedIn regional sites where name transliteration added ambiguity (about 4%).

What went wrong

The dedup problem

About 6,000 of the 70,000 input records were duplicates. Same person, slightly different name spelling, different company name variation. The CRM had accumulated these over years of list imports from multiple sources.

The initial run created 6,000 duplicate enriched records in HubSpot. Not because the matching was wrong, but because the input had 6,000 people listed twice. The fix was adding a dedup layer to the pipeline that normalized name+company pairs and flagged probable duplicates before extraction.

Lesson: always dedup the input before enrichment, not after. It is cheaper, faster, and avoids polluting the CRM with duplicate enriched records that then need manual cleanup.

The title mismatch window

LinkedIn profiles update when people change jobs. But people do not always update immediately. Between the time someone starts a new role and the time they update their LinkedIn profile, there is a window where the extracted title is wrong.

For the 70,000-record batch, I estimated this affected about 3 to 5% of records based on spot-checks against company directories. The mitigation was adding a "last profile update" signal to the confidence scoring. Profiles that had not been updated in 12+ months got a lower confidence score.

For the ongoing enrichment phase (after the initial batch), the pipeline re-checks high-value contacts quarterly and flags title changes.

The rate pacing balance

The initial extraction rate was too conservative. At the pace I started with, the 70,000 records would have taken 6 weeks to process. Too slow for the client's campaign timeline.

Increasing the pace risked detection. The balance point was processing about 4,000 to 5,000 profiles per day, spread across multiple sessions with realistic timing patterns. At that pace, the full batch completed in about 3 weeks with zero detection events.

The pacing is not a fixed number. It adjusts based on signals that I monitor during the run. If anything suggests the infrastructure is drawing attention, the pace drops automatically.

What I would do differently

Input quality validation upfront. I should have run a dedup and data quality pass on the 70,000 records before starting extraction. It would have added 2 days to the timeline but saved a week of cleanup on the CRM side.

Confidence thresholds in the contract. The client expected 90%+ match rate. The actual rate was 87%, which they were happy with, but the expectation gap caused a brief conversation that could have been avoided by setting the threshold explicitly in the scope document.

Incremental delivery. Instead of processing all 70,000 and delivering at the end, I should have delivered in batches of 5,000 to 10,000 so the sales team could start outreach on the first batch while the rest was still processing. This is now my default approach for any engagement over 5,000 records.

The ongoing phase

After the initial batch, the client moved to the ongoing enrichment tier. New leads entering HubSpot are automatically processed: the pipeline connects via webhook, enriches new contacts within 24 hours, and writes the fields back in place.

Quarterly re-enrichment catches title and company changes on existing contacts. The most common change is title upgrades (IC to manager, manager to director) which directly affects outreach priority.

The pipeline has been running for several months. Maintenance is light on the LinkedIn side (the infrastructure handles platform changes) and mostly involves adjusting the CRM field mapping when the client adds new properties to their HubSpot schema.

When this approach fits

Managed LinkedIn enrichment works best when:

You have a large existing lead list that needs enrichment before outreach
Your target audience is niche enough that standard enrichment platforms underperform on match rate
You need CRM write-back, not a CSV download
You cannot risk using your team's LinkedIn accounts for extraction
You want ongoing enrichment as new leads enter your pipeline

The LinkedIn data service page has the full pricing, output schema, and delivery format options. For the self-serve approach, the ScrapeBase API at scrapebase.io has LinkedIn endpoints for profile data, articles, and posts.