Building my AI University Semantic Search App

Rahul Bhattacharya Sep 27, 2025 05:16 AM

Live Demo GitHub

Synopsis: In this project I develop a semantic search tool for exploring universities using natural language queries. By interpreting meaning rather than keywords, it connects descriptive inputs to structured data, enabling flexible discovery and demonstrating how search can become more intuitive and user-friendly.

The idea for this project came from a search problem. I often wanted to explore universities by describing them in plain language. I imagined scenarios where I would type phrases like technical university in Germany, old or medical college in Japan established before 1900. That thought pushed me to think about semantic search. Dataset used here.

Building my AI University Semantic Search App

I decided to create an application that would accept natural language queries and map them to real universities. Instead of fixed keyword matching, the system uses embeddings. This allowed me to capture meaning rather than surface tokens. The process was not only educational but also showed me how embeddings can make structured data more accessible. This is the story of how I built, packaged, and deployed the system.

Repository Overview

The repository contains several important files:

app.py — the main Streamlit application code.
requirements.txt — the dependency specification for deployment.
README.md — the guide that documents the pipeline.
data/universities_embedded.parquet — metadata of universities including names, countries, and inception years.
data/universities_embedded.npz — precomputed embeddings created in Colab.

Each of these files is critical. The application cannot function without the embeddings and metadata. The app file orchestrates the pipeline while the requirements lock dependencies to consistent versions.

Requirements File

The requirements file ensures the runtime has all necessary packages. Below is the content:

streamlit==1.37.1
pandas==2.2.2
pyarrow==16.1.0
numpy==1.26.4
scikit-learn==1.4.2
sentence-transformers==2.7.0

Each dependency plays a direct role.

Streamlit creates the user interface.
Pandas handles structured metadata stored in parquet.
PyArrow is required to read parquet efficiently.
NumPy manages the embeddings in arrays.
Scikit‑learn provides cosine similarity functions.
Sentence‑transformers loads the same model used for embedding text queries.

Without locking versions, the app may behave differently on Streamlit Cloud. Pinning exact versions removes that risk.

README File

The README explains the flow. It shows that data is produced in Colab and uploaded here. It clarifies that two files, the parquet and npz, must exist in the data/ folder. The document also gives deployment steps. For example, it instructs how to connect the repository to Streamlit Cloud and specify app.py as the entry file. Finally, it lists example queries that demonstrate the natural language search capability.

This README is short but crucial. It provides instructions for others to reproduce results. Without it, the context of how embeddings were generated would be missing. That gap would stop new users from understanding why those artifacts exist.

Application Code

The application is defined entirely inside app.py. It contains about 176 lines. I will go block by block and explain the purpose and design decisions.

Import Block

import os
import numpy as np
import pandas as pd
import streamlit as st

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

This block loads core libraries. os checks for file existence. numpy manages numeric arrays including embeddings. pandas reads parquet files and handles metadata. streamlit powers the interface. The external libraries are SentenceTransformer to load the embedding model and cosine_similarity from scikit‑learn to compute similarity between query vectors and university vectors. Every later step depends on these imports.

Page Configuration

st.set_page_config(page_title="University Semantic Search", layout="wide")
st.title("University Semantic Search")
st.caption(
    "Type natural language: e.g., 'technical university in Germany, old' or "
    "'medical college in Japan established before 1900'."
)

This section configures the page layout. The title provides immediate context for the user. The caption demonstrates usage by giving realistic queries. The wide layout ensures the results table has enough horizontal space. Without these settings the interface would look cramped. Streamlit provides these helpers to make apps professional.

Resource Loading Function

@st.cache_resource
def load_resources():
    meta_path = "data/universities_embedded.parquet"
    vec_path = "data/universities_embedded.npz"

    if not os.path.exists(meta_path) or not os.path.exists(vec_path):
        st.error("Missing data artifacts. Make sure both files exist: "
                 "`data/universities_embedded.parquet` and `data/universities_embedded.npz`.")
        st.stop()

    df_meta = pd.read_parquet(meta_path)
    npz = np.load(vec_path)
    emb = npz["emb"].astype("float32")

    if "row_id" not in df_meta.columns:
        df_meta = df_meta.copy()
        df_meta["row_id"] = np.arange(len(df_meta))

    if len(df_meta) != emb.shape[0]:
        st.error(f"Row count mismatch: parquet has {len(df_meta)} rows but embeddings have {emb.shape[0]} vectors.")
        st.stop()

    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

    if "inception" in df_meta.columns:
        df_meta["inception_num"] = pd.to_numeric(df_meta["inception"], errors="coerce")
    else:
        df_meta["inception_num"] = np.nan

    valid_years = df_meta["inception_num"].dropna().astype(int)
    has_year = valid_years.size > 0
    if has_year:
        year_min = int(valid_years.min())
        year_max = int(valid_years.max())
    else:
        year_min, year_max = None, None

    return df_meta, emb, model, has_year, year_min, year_max

This is the most important helper in the app. It is decorated with @st.cache_resource so results are cached between runs. That avoids reloading data repeatedly. The function loads metadata and embeddings from the two files. It then validates that the number of rows matches the number of vectors. This prevents silent misalignment errors.

A row_id column is created if missing. This ensures that every university has a stable index. The model is then loaded using the same transformer used during embedding creation. This guarantees consistent vector spaces. Finally, the inception year column is cleaned into numeric form, and min and max years are calculated if present. This supports optional filtering later. Without this function, the entire search pipeline cannot run.

Query Encoding Function

def encode_query(model, text):
    vec = model.encode([text], convert_to_numpy=True, normalize_embeddings=True)
    return vec

This helper converts a string into an embedding vector. It takes the loaded model and applies it to the query. The embedding is normalized which simplifies cosine similarity later. This function encapsulates encoding so the main logic does not repeat code. It represents the bridge from natural language into numeric form.

Search Function

def search(df_meta, emb, model, query, topk=10, year_range=None):
    qvec = encode_query(model, query)
    sims = cosine_similarity(qvec, emb)[0]
    df_meta = df_meta.copy()
    df_meta["score"] = sims

    if year_range is not None:
        lo, hi = year_range
        df_meta = df_meta[
            (df_meta["inception_num"].isna()) |
            ((df_meta["inception_num"] >= lo) & (df_meta["inception_num"] <= hi))
        ]

    df_top = df_meta.sort_values("score", ascending=False).head(topk)
    return df_top

This function performs the actual semantic search. It encodes the query, computes similarity between query vector and all embeddings, and attaches the score to metadata. If a year range is provided, results are filtered to universities founded within that range. The final step sorts by similarity score and keeps the top results. This function ties together encoding, similarity, and filtering.

df_meta, emb, model, has_year, year_min, year_max = load_resources()

with st.sidebar:
    query = st.text_input("Search query")
    topk = st.slider("Top K results", min_value=5, max_value=50, value=10)
    year_range = None
    if has_year:
        year_range = st.slider("Year range", min_value=year_min, max_value=year_max,
                               value=(year_min, year_max))

Here the application builds the sidebar. It first loads resources. Then it creates a text input for the query. A slider controls how many results to show. If inception years exist, another slider lets the user restrict the range. This block collects all user input needed for search. Streamlit automatically re‑runs the script when inputs change.

Main Execution

if query:
    df_top = search(df_meta, emb, model, query, topk=topk, year_range=year_range)
    st.dataframe(df_top[["label", "country", "inception", "score"]])
else:
    st.info("Enter a query to start searching.")

This conditional checks if a query has been entered. If yes, it calls the search function and displays the resulting dataframe with relevant columns. The score column shows similarity strength. If no query is given, an info message guides the user. This conditional ensures the app does nothing until input is provided, which avoids wasted computation.

Explanation of Data Artifacts

Two files must be generated externally.

universities_embedded.parquet holds metadata with names, countries, and inception years.
universities_embedded.npz stores the embedding array with vectors for each university.

These are produced in Colab using the same transformer model. Once generated, they are uploaded into the data/ folder. The Streamlit app only consumes them; it does not rebuild them. This separation keeps deployment lightweight.

Deployment Flow

Deployment was done on Streamlit Cloud. After pushing the repository to GitHub, I connected it to Streamlit, pointed to app.py, and set dependencies from requirements.txt. Because the data files were under 100 MB, they could be stored in GitHub directly. The caching mechanism in load_resources ensured fast reloads during usage. After deployment, the app was ready to accept queries immediately.

Lessons Learned

Building this taught me several lessons. Semantic search becomes very practical when combined with embeddings and structured data. Streamlit provides a straightforward way to expose machine learning models to end users. The biggest challenge was ensuring data alignment between parquet and npz. Adding validation checks saved time during debugging. Another lesson was that explicit version pinning in requirements.txt is essential for stability.

Deep Dive into Each Function and Conditional

Why Cache Resources?

Caching resources in Streamlit is not only about speed but also about user experience. Each time a widget changes, Streamlit reruns the script. If resources were reloaded on every rerun, the app would become sluggish. The decorator @st.cache_resource ensures that heavy data loading and model initialization happen only once. This makes the application responsive even when embeddings are large.

Validating Input Files

The function load_resources checks if both the parquet and npz files exist. This conditional prevents cryptic runtime errors. If a file is missing, the user sees a clear error message. The program then stops execution using st.stop(). This pattern is safer than letting the code fail later with a stack trace. It improves reliability of the interface.

Creating Row IDs

Another conditional checks if row_id exists. This is important because embeddings are linked by index. If the parquet metadata does not carry an explicit index, a new one is created with np.arange. This guarantees one‑to‑one alignment. The design choice here avoids data mismatches. Without this guard, similarity results might point to the wrong university.

Matching Metadata and Embeddings

The function also verifies that the length of metadata matches the embedding array. If not, it raises an error and stops. This ensures that every row has a corresponding vector. Data alignment is one of the most common pitfalls in machine learning systems. Adding this explicit check makes the project more robust.

Handling Inception Years

Not all records have inception years. The code converts the column into numeric values using pd.to_numeric with errors="coerce". Any invalid entries become NaN. This design allows later filtering without crashing. The conditional then calculates min and max only if valid years exist. This extra step supports the year slider in the interface.

Understanding the Search Pipeline

Encoding Queries

The helper encode_query is minimal but powerful. It encapsulates embedding logic so other functions remain clean. By normalizing embeddings, cosine similarity becomes equivalent to dot product. This optimization is subtle but important. It avoids unnecessary magnitude differences.

Cosine Similarity

The search function uses cosine_similarity to compare query vectors with all embeddings. The result is a score for each university. Attaching the score back to the dataframe integrates numeric results with metadata. Sorting by score produces the ranking.

Year Range Filtering

The optional year_range argument is a key feature. It allows users to focus on universities within a historical window. The conditional uses logical operators to keep rows with inception years inside the range or NaN. This inclusive design ensures that institutions without years are not unfairly discarded.

Streamlit automatically rebuilds the page when sidebar inputs change. The sidebar groups controls neatly. The text_input widget lets users type queries. The slider for top K results is intuitive for adjusting output length. Another slider for year range uses min and max calculated earlier. This dynamic binding makes the app interactive with minimal code.

Displaying Results

The main conditional displays results only when a query exists. This prevents confusion when the app first loads. The table shows label, country, inception, and score. Presenting only these columns keeps the interface clean. The choice to include score is deliberate; it reveals ranking strength. This transparency helps users trust the output.

Future Extensions

Several extensions are possible. The metadata could be enriched with student population, research output, or global ranking. The embeddings could be fine‑tuned for better performance on educational queries. The interface could allow faceted search by continent or type. These features can be layered on the current pipeline without major redesign.

Reflections

Looking back, the most striking aspect is how small code can deliver powerful search. Less than 200 lines integrate deep learning, efficient similarity, and a user‑friendly interface. The project demonstrates how modern NLP tools reduce barriers. Users with comprehensive Python skills can now create semantic systems that once required large teams.

Thoughts

The University Semantic Search project began from a simple frustration but evolved into a showcase of applied machine learning. It emphasizes the importance of validation, caching, and modular functions. The design choices highlight practical engineering: explicit checks, clear messages, and reproducible data pipelines. This balance of theory and practice makes the project a valuable template for future work.

Deployment Notes

Deploying on Streamlit Cloud is straightforward but a few points are worth noting. The platform automatically installs dependencies from requirements.txt. However, large models are not downloaded during runtime. That is why the embeddings were precomputed in Colab. This separation reduces cold start time. Streamlit Cloud has memory limits, so keeping the app lightweight matters. By uploading only essential artifacts, I avoided hitting those limits.

Another deployment detail is caching behavior. The first run after deployment can take a few seconds because it loads the parquet and npz. Later runs are instant due to caching. Users rarely notice delays, which improves perception of performance. Sharing the app link allows anyone to experiment without local setup. This accessibility was a primary motivation.

Lessons

This project shows that semantic search can be applied beyond universities. The same pattern works for products, books, or research papers. The critical step is building embeddings for items and storing metadata. Once that is done, the rest of the pipeline is identical. Streamlit makes it easy to reuse the interface for new domains. Understanding this generality was a big lesson.

Another lesson was about balancing transparency and abstraction. I chose to show similarity scores to users, but I hid the raw vectors. The interface reveals enough for trust but avoids overwhelming users. This principle can guide other applications too. Always decide which internal details should surface and which should remain hidden.

Remarks

The journey from idea to deployment taught me how modern tools shorten development cycles. Ten years ago, building such a system would require custom servers and complex front ends. Now, a single Python file and a few data artifacts are enough. This democratization of technology is powerful. It opens doors for students, researchers, and hobbyists to explore semantic methods. I built this for a personal reason, but the lessons apply widely. That is the real success of the project.

Conclusion

The University Semantic Search app demonstrates how semantic embeddings can power intuitive queries. By breaking down the pipeline into data preparation, embedding, caching, searching, and displaying, the system becomes reliable and reproducible. The modular design makes it easy to extend, such as adding new filters or expanding metadata. This project began as a personal need but grew into a demonstration of how modern NLP can transform structured search.

Building my AI University Semantic Search App

Repository Overview

Requirements File

README File

Application Code

Import Block

Page Configuration

Resource Loading Function

Query Encoding Function

Search Function

Sidebar Inputs

Main Execution

Explanation of Data Artifacts

Deployment Flow

Lessons Learned

Deep Dive into Each Function and Conditional

Why Cache Resources?

Validating Input Files

Creating Row IDs

Matching Metadata and Embeddings

Handling Inception Years

Understanding the Search Pipeline

Encoding Queries

Cosine Similarity

Year Range Filtering

Streamlit Sidebar Explained

Displaying Results

Future Extensions

Reflections

Thoughts

Deployment Notes

Lessons

Remarks

Conclusion

All Posts

Repository Overview

Requirements File

README File

Application Code

Import Block

Page Configuration

Resource Loading Function

Query Encoding Function

Search Function

Sidebar Inputs

Main Execution

Explanation of Data Artifacts

Deployment Flow

Lessons Learned

Deep Dive into Each Function and Conditional

Why Cache Resources?

Validating Input Files

Creating Row IDs

Matching Metadata and Embeddings

Handling Inception Years

Understanding the Search Pipeline

Encoding Queries

Cosine Similarity

Year Range Filtering

Streamlit Sidebar Explained

Displaying Results

Future Extensions

Reflections

Thoughts

Deployment Notes

Lessons

Remarks

Conclusion

Related Posts