Building a Privacy-First LinkedIn Analytics Platform

Developing a tool that turns LinkedIn™ data export into a clear, visual overview of professional relationships, helping users see who they’re connected with, identify meaningful clusters, prioritize important contacts, clean up outdated connections, and discover new people worth engaging with.
App
Visualization
NLP
Python
Author

Aleksei Prishchepo

Published

December 27, 2025

Misaligned Objective Function

LinkedIn’s recommendation system might be an engineering masterpiece, but its objective function is fundamentally misaligned with mine. From the engineering point of view, every system optimizes for a specific target variable. LinkedIn’s algorithms solve for engagement metrics like time on site, scroll depth, and ad impressions. They prioritize content that triggers a reaction, often at the expense of substance.

In contrast, I try solve for meaningful connection. My goal is to foster genuine professional alignment, identify mentorship opportunities, and strengthen ties within specific industry clusters.

When two optimization problems have different loss functions, they inevitably yield different results. Relying solely on the platform’s feed means delegating your professional network’s growth to an algorithm designed to keep you addicted, not necessarily to help you succeed. I decided I wanted to choose who to interact with deliberately, moving from a passive consumer of a feed to an active architect of my network.

To achieve this, I started building a tool that takes the LinkedIn™ data export and transforms it into an intelligence layer. It turns that raw CSV data into a clear, visual overview of professional relationships, helping users:

Explore the application:

My Professional Network

  • See who they’re connected with through interactive visualizations.
  • Identify meaningful professional clusters using unsupervised learning.
  • Prioritize important contacts who may have slipped through the cracks.
  • Clean up outdated connections that add noise to the signal.
  • Discover new people worth engaging with based on geographic and professional proximity.

This article details how I architected this solution using Python, Shiny, and a local inference stack to bring observability and agency back to professional networking.

Microservices Approach

Building a local-first application that handles data munging, heavy UI rendering, and asynchronous secondary data fetching requires a robust architecture. I opted for a microservices-based approach orchestrated via Docker Compose. This ensures that the ML inference and data scraping doesn’t starve the web server of resources.

High-Level System Design

The architecture is built on five pillars:

  1. Orchestration: Nginx serves as a reverse proxy and load balancer.

  2. UI Layer: Shiny for Python handles the reactive frontend.

  3. Database: PostgreSQL stores connection metadata, parsed profiles, and transaction logs.

  4. Asynchronous Workers: Celery + Redis handle long-running tasks like profile scraping and geocoding.

  5. Local Inference Service: A dedicated Text Embeddings Inference (TEI) service runs locally to provide high-performance vector representation of professional data.

Application architecture

Why Shiny for Python?

Choosing the right frontend framework was critical. Traditional SPAs (Single Page Applications) often feel disconnected from the data science lifecycle. Shiny for Python bridges this gap.

NoteReactive Programming

I can define dependencies between UI filters (like geography or industry clusters) and the underlying data without manually managing state in JavaScript.

NoteNative Integration

Since the logic stays in Python, I can directly call onto Pandas, Scikit-learn, and ML services without building REST APIs for every filter action.

NoteEnterprise-Grade UI

By applying curated themes and vanilla CSS, I was able to create a refined aesthetic while preserving the performance of an efficient website.

Decoupling with Celery and Redis

One of the requirements for this project was “No-Hang UI”. Parsing thousands of profiles or geocoding hundreds of locations can take minutes or even hours. If these were synchronous calls, the application would be unusable.

I implemented a robust Celery task group (chord/group pattern). When a user uploads their ZIP, the system immediately recognizes which profiles are missing enriched data. It spins up a chord of tasks:

  • Trigger: Dispatch batches of URLs to the scraper.

  • Poll: Periodically check for result readiness.

  • Finalize: Once all batches are parsed, update the database and notify the UI via a reactive signal.

This separation of concerns ensures that the web server remains responsive to user interactions while the background workers handle the heavy I/O and processing.

Technical Highlights

Semantic Understanding of Job Titles

The core value of this application lies in its ability to turn various forms of job titles into meaningful groupings.

TipLocal ML Inference with Vector Embeddings

The very first implementation used TF/IDF vectorization combined with KMeans clustering. While this provided a basic grouping, it failed to capture the semantic nuances. For example, “Data Scientist” and “ML Engineer” would be placed in separate clusters despite their close relationship.

To achieve semantic understanding, I moved toward Vector Embeddings. Instead of treating words as discrete tokens, I represent each job title as a 768-dimensional vector in a continuous semantic space.

Instead of relying on costly external APIs (like OpenAI), I hosted a local Text Embeddings Inference (TEI) service using Hugging Face’s huggingface/text-embeddings-inference container. This provides:

  • Privacy: No professional data ever leaves the local environment.

  • Zero Latency/Cost: High-speed inference without per-token billing.

  • Semantic Accuracy: Using multilingual-mpnet-base-v2, the system handles professional jargon across multiple languages.

To maintain performance, I implemented a multi-layered caching strategy:

@lru_cache(maxsize=1024)
def get_embedding(text: str) -> list[float]:
    if not text.strip():
        return [0 for _ in range(768)]
    try:
        response = requests.post(
            TEXT_EMBEDDINGS_URL,
            json={"inputs": [text]},
        )
        response.raise_for_status()
        return response.json()[0]
    except Exception as e:
        # Fallback to zero-vector or log error
        return [0 for _ in range(768)]

This caching ensures that common titles (e.g., “Founder”, “Engineer”) only hit the inference service once per session.

Unsupervised Clustering Pipeline

Once we have vectors for every connection, the task is to group them. I built a pipeline that combines Latent Semantic Analysis (LSA) for dimensionality reduction and KMeans for clustering.

The pipeline looks like this:

  • Preprocessing: Normalize job titles and resolve common abbreviations (e.g., “Sr.” to “Senior”).

  • LSA (TruncatedSVD + Normalizer): Reduces the 768 dimensions to a denser representation (96 components), focusing on the most significant semantic variance and reducing computational overhead for the clustering step.

  • KMeans Clustering: Groups the denser vectors into \(N\) clusters.

A key challenge was dynamic cluster naming. An unsupervised model only gives you cluster IDs (e.g., Cluster #4), which are useless for a user. I implemented a heuristic that automatically names each cluster by identifying the most frequently occurring job title closest to the cluster’s centroid:

# Identifying the representative name for Cluster #X
df["PosFreq"] = df.groupby("Position")["Cluster"].transform("count")
# Filter and sort to find the dominant title that isn't too short
positions = df.query("PosFreq == MaxFreq").drop_duplicates("Cluster")
# The resulting Cluster Name provides context like "Engineering Manager" or "Product Design"

Geocoding at Scale

Finally, to power the geographical groupings and visualizations, I integrated a geocoding service. To keep the system efficient, I implemented a persistence-layer cache. Instead of geocoding every connection’s location string individually, I maintain a locations table. The Celery worker only hits the Google Maps Geocoding API for location strings that haven’t been resolved before, significantly reducing API usage and improving processing speed for subsequent data uploads.

Geographic locations of connections are displayed as grouped sections in a treemap and also shown on a world map

System Robustness

An engineer’s work is defined not just by the “happy path”, but by how the system handles state, persistence, and deployment.

Data Integrity

I utilized PostgreSQL to manage the application’s state. This allows for complex relation management (e.g., linking a Connection to their Geolocation and their Clusters). To handle the “Generative Credits” system, I implemented atomic transactions to ensure that balance deductions only occur after a successful profile parse, preventing data inconsistency.

Local Observability

Running a complex stack locally can be a “black box” challenge. By utilizing Docker Compose, I centralized log management and service health checks. If the Text Embeddings Inference (TEI) service or the Redis broker fails, the Nginx load balancer and Shiny app provide immediate feedback rather than obscure Python stack traces.

Future Roadmap

RAG (Retrieval-Augmented Generation): By indexing the parsed profiles into a vector database (like Chroma or Faiss), I could implement a private LLM chat interface to ask questions like: “Who in my network has experience with Kubernetes who been active in the last months?”

Conclusion

The My Professional Network project is more than just a dashboard; it’s a prototype for data agency. By moving the analytics layer from a centralized platform to a local, user-controlled environment, we reclaim the ability to navigate our professional lives with intent.

By combining modern frontend reactivity (Shiny), asynchronous infrastructure (Celery/Redis), and local ML inference (TEI), I’ve built a system that respects privacy while delivering the kind of “superpowers” usually reserved for big-tech internal tools.

In the end, your network is your most valuable professional asset — it’s time you had the tools to actually manage it.

See Also

Below are some of my other posts related to building applications, natural language processing, and data visualization:

Back to top