Skip to content

Cuban Address Manzanification System

Table of Contents


Overview

This system automates the process of matching Cuban street addresses to their corresponding manzanas (housing blocks). The system is designed to process batches of addresses from Excel files and return enriched data with manzana assignments and confidence scores.

Use Case Flow

  1. User uploads Excel file with addresses
  2. Backend processes addresses through manzanification engine
  3. System returns Excel with added manzana column
  4. User reviews/corrects results in separate validation tool

What is Manzanification?

Manzanification is the process of geocoding Cuban addresses to their administrative housing block units called manzanas.

A manzana is a city block or housing unit used in Cuban urban planning and administration. Each manzana has a unique identifier (e.g., 781A, 817, 676).

Cuban Address Structure

Cuban addresses typically follow these patterns:

PatternExampleComponents
Street + Number + Cross StreetsCALLE 39 # 2012 / 20 Y 22Main street, house number, intersecting streets
Avenue + NumberAVE 48 # 6116 / 61 Y 63Avenue, house number, cross streets
Building + ApartmentEDF 44 APT 5 JSBuilding number, apartment, settlement code
Street + Cross Streets (no number)CALLE 61 / 20 Y 22Main street, cross streets only

The cross streets (e.g., "20 Y 22") define the block boundaries, making spatial matching feasible.


Algorithm Research & Selection

We evaluated four spatial matching algorithms for this project:

Algorithm 1: Point-in-Polygon

Approach: Generate a point from the address coordinates and check which manzana polygon contains it.

Process:

  1. Find main street geometry
  2. Find intersection point of cross streets
  3. Use house number to offset point along the street
  4. Check which manzana polygon contains this point

Pros:

  • Simple and fast
  • Low computational overhead

Cons:

  • Requires accurate point generation
  • Assumes house numbers are evenly distributed
  • Fails if point falls on boundary or in gaps
  • No handling of ambiguous cases

Verdict: Too brittle for real-world Cuban address data


Algorithm 2: Street Segment + Buffer (SELECTED)

Approach: Find the street segment between cross streets, buffer it, and intersect with manzanas.

Process:

  1. Get main street LineString (e.g., "C-39")
  2. Find intersections with cross streets ("C-20" and "C-22")
  3. Extract street segment between these two points
  4. Determine which side using house number logic
  5. Buffer the segment appropriately
  6. Find manzana(s) that intersect this buffer
  7. Return manzana with highest overlap area

Pros:

  • Uses actual street geometry, not just points
  • Cross streets provide precise block boundaries
  • Buffer handles boundary inaccuracies in shapefiles
  • Works even if house numbers aren't perfectly sequential
  • Returns confidence scores based on overlap
  • Graceful degradation when cross streets missing

Cons:

  • Requires calibration of buffer distance
  • Assumes street geometries are relatively accurate
  • Side determination logic may need adjustment

Verdict: ✅ SELECTED - Best balance of accuracy, robustness, and maintainability


Algorithm 3: Nearest Neighbor with Constraints

Approach: Find manzanas near the intersection, rank by distance and constraints.

Process:

  1. Find intersection of main street with cross streets
  2. Get all manzanas within X meters of intersection
  3. Filter manzanas that actually touch the main street
  4. Use house number range to pick the right one
  5. Return closest match

Pros:

  • Handles messy/incomplete data well
  • Doesn't require perfect street geometries
  • Good for quick prototyping

Cons:

  • Requires manual tuning of distance threshold
  • May return wrong manzana if threshold too large
  • Less precise than segment-based approach
  • Doesn't use full geometric information

Verdict: Good fallback option, but less precise than Algorithm 2


Algorithm 4: Graph-Based Network Analysis

Approach: Model streets as a network graph and navigate to the address location.

Process:

  1. Build street network graph from shapefiles
  2. Find intersection node (cross streets + main street)
  3. Navigate along main street for estimated distance
  4. Find manzana at that graph position

Pros:

  • Most accurate for complex urban layouts
  • Handles one-way streets, connectivity issues
  • Good for routing applications

Cons:

  • Significant implementation complexity
  • Higher computational overhead
  • Requires clean, well-connected street network
  • Overkill for static address matching

Verdict: Too complex for current requirements; revisit if routing needed


Installation

Required Python Libraries

pip install geopandas pandas shapely openpyxl

Library Descriptions

LibraryPurposeDocumentation
geopandasReading shapefiles, spatial operationsdocs
pandasExcel I/O, data manipulationdocs
shapelyGeometric operations (buffers, intersections)docs
openpyxlExcel file reading/writingdocs

Additional Dependencies

These are typically installed automatically with geopandas:

  • fiona - Shapefile I/O
  • pyproj - Coordinate system transformations
  • rtree - Spatial indexing for performance

Implementation

Sample Code

Below is the complete implementation of the Street Segment + Buffer algorithm:

`import geopandas as gpd import pandas as pd from shapely.geometry import Point, LineString from shapely.ops import nearest_points import re

class ManzanaMatcher: def init(self, streets_shapefile, manzanas_shapefile): """ Initialize the matcher with shapefiles.

    Args:
        streets_shapefile: Path to streets shapefile
        manzanas_shapefile: Path to manzanas shapefile
    """
    self.streets = gpd.read_file(streets_shapefile)
    self.manzanas = gpd.read_file(manzanas_shapefile)
    
    # Ensure same CRS
    if self.streets.crs != self.manzanas.crs:
        self.streets = self.streets.to_crs(self.manzanas.crs)

def parse_address(self, address):
    """
    Parse Cuban address into components.
    
    Returns dict with:
    - main_street: e.g., "C-39" or "A-46"
    - cross_street_1: e.g., "20"
    - cross_street_2: e.g., "22"
    - house_number: e.g., "2406"
    - building: e.g., "44"
    - apartment: e.g., "5"
    """
    result = {
        'main_street': None,
        'cross_street_1': None,
        'cross_street_2': None,
        'house_number': None,
        'building': None,
        'apartment': None
    }
    
    # Extract main street (CALLE XX or AVE XX)
    street_match = re.search(r'(CALLE|AVE|C-|A-)(\d+)', address, re.IGNORECASE)
    if street_match:
        prefix = 'C-' if 'C' in street_match.group(1).upper() else 'A-'
        result['main_street'] = f"{prefix}{street_match.group(2)}"
    
    # Extract house number (# XXXX)
    number_match = re.search(r'#?\s*(\d{3,4})(?!\s*APT)', address)
    if number_match:
        result['house_number'] = number_match.group(1)
    
    # Extract cross streets (/ XX Y XX or ENTRE XX Y XX)
    cross_match = re.search(r'[/|ENTRE]\s*(\d+)\s*[YyXx]\s*(\d+)', address, re.IGNORECASE)
    if cross_match:
        result['cross_street_1'] = cross_match.group(1)
        result['cross_street_2'] = cross_match.group(2)
    
    # Extract building (EDF XX or EDIFICIO XX)
    building_match = re.search(r'(?:EDF|EDIFICIO)\s*(\d+)', address, re.IGNORECASE)
    if building_match:
        result['building'] = building_match.group(1)
    
    # Extract apartment
    apt_match = re.search(r'APT\s*(\d+)', address, re.IGNORECASE)
    if apt_match:
        result['apartment'] = apt_match.group(1)
    
    return result

def find_street(self, street_name):
    """Find street geometry by name."""
    # Adjust column name based on your shapefile
    # Common names: 'name', 'NAME', 'street_nam', 'CALLE'
    street_col = 'name'  # CHANGE THIS to match your shapefile
    
    matches = self.streets[self.streets[street_col].str.contains(
        street_name, case=False, na=False)]
    
    if len(matches) == 0:
        return None
    return matches.iloc[0].geometry

def find_intersection(self, main_street_geom, cross_street_1, cross_street_2):
    """
    Find the segment of main street between two cross streets.
    
    Returns:
        LineString segment between the two cross streets
    """
    # Find cross street geometries
    cross_geom_1 = self.find_street(f"C-{cross_street_1}")
    cross_geom_2 = self.find_street(f"C-{cross_street_2}")
    
    if cross_geom_1 is None or cross_geom_2 is None:
        # Try with A- prefix
        if cross_geom_1 is None:
            cross_geom_1 = self.find_street(f"A-{cross_street_1}")
        if cross_geom_2 is None:
            cross_geom_2 = self.find_street(f"A-{cross_street_2}")
    
    if cross_geom_1 is None or cross_geom_2 is None:
        return None
    
    # Find intersection points
    int_point_1 = main_street_geom.intersection(cross_geom_1)
    int_point_2 = main_street_geom.intersection(cross_geom_2)
    
    if int_point_1.is_empty or int_point_2.is_empty:
        return None
    
    # Get the actual point (in case intersection returns multiple geometries)
    if hasattr(int_point_1, 'geoms'):
        int_point_1 = list(int_point_1.geoms)[0]
    if hasattr(int_point_2, 'geoms'):
        int_point_2 = list(int_point_2.geoms)[0]
    
    # Create segment between intersection points
    segment = LineString([int_point_1, int_point_2])
    return segment

def determine_side(self, house_number):
    """
    Determine which side of street based on house number.
    
    Returns: 'left' or 'right' or 'both'
    """
    if house_number is None:
        return 'both'
    
    # Even/odd logic (common in many cities)
    # Adjust this based on Cuban numbering conventions
    num = int(house_number)
    if num % 2 == 0:
        return 'right'
    else:
        return 'left'

def find_manzana(self, address):
    """
    Main function to find manzana for an address.
    
    Returns:
        dict with 'manzana' code and 'confidence' score
    """
    # Parse address
    parsed = self.parse_address(address)
    
    # Handle building-only addresses (like EDF 44 APT 5 JS)
    if parsed['building'] and not parsed['main_street']:
        # For building-based addresses, you might need a separate lookup
        # or manual mapping. For now, return low confidence.
        return {'manzana': None, 'confidence': 0.0, 'method': 'building_only'}
    
    if not parsed['main_street']:
        return {'manzana': None, 'confidence': 0.0, 'method': 'no_street'}
    
    # Find main street
    main_street_geom = self.find_street(parsed['main_street'])
    if main_street_geom is None:
        return {'manzana': None, 'confidence': 0.0, 'method': 'street_not_found'}
    
    # Find street segment between cross streets
    if parsed['cross_street_1'] and parsed['cross_street_2']:
        segment = self.find_intersection(
            main_street_geom, 
            parsed['cross_street_1'], 
            parsed['cross_street_2']
        )
        if segment is None:
            # Fallback: use entire street
            segment = main_street_geom
            confidence_penalty = 0.3
        else:
            confidence_penalty = 0.0
    else:
        # No cross streets, use entire main street
        segment = main_street_geom
        confidence_penalty = 0.5
    
    # Determine side
    side = self.determine_side(parsed['house_number'])
    
    # Buffer the segment (adjust buffer distance based on your data)
    buffer_distance = 50  # meters, adjust as needed
    
    if side == 'both':
        buffered = segment.buffer(buffer_distance)
    elif side == 'right':
        # Buffer only to the right
        buffered = segment.buffer(buffer_distance, single_sided=False)
    else:  # left
        # Buffer only to the left
        buffered = segment.buffer(buffer_distance, single_sided=False)
    
    # Find manzanas that intersect the buffer
    # Adjust column name based on your shapefile
    manzana_col = 'manzana'  # CHANGE THIS to match your shapefile
    
    intersecting = self.manzanas[self.manzanas.intersects(buffered)]
    
    if len(intersecting) == 0:
        return {'manzana': None, 'confidence': 0.0, 'method': 'no_intersection'}
    
    # Calculate overlap area for each candidate
    intersecting = intersecting.copy()
    intersecting['overlap'] = intersecting.geometry.apply(
        lambda geom: geom.intersection(buffered).area
    )
    
    # Get the manzana with highest overlap
    best_match = intersecting.loc[intersecting['overlap'].idxmax()]
    manzana_code = best_match[manzana_col]
    
    # Calculate confidence based on:
    # - How many candidates there were
    # - Whether cross streets were found
    # - Overlap ratio
    confidence = 1.0 - confidence_penalty
    if len(intersecting) > 1:
        confidence *= 0.8
    
    return {
        'manzana': manzana_code,
        'confidence': round(confidence, 2),
        'method': 'spatial_match',
        'candidates': len(intersecting)
    }

def process_excel(self, input_path, output_path, address_column='DIRECCION'):
    """
    Process an Excel file with addresses and add manzana column.
    
    Args:
        input_path: Path to input Excel file
        output_path: Path to output Excel file
        address_column: Name of column containing addresses
    """
    # Read Excel
    df = pd.read_excel(input_path)
    
    # Process each address
    results = []
    for address in df[address_column]:
        result = self.find_manzana(address)
        results.append(result)
    
    # Add results to dataframe
    df['MANZANA_PREDICTED'] = [r['manzana'] for r in results]
    df['CONFIDENCE'] = [r['confidence'] for r in results]
    df['MATCH_METHOD'] = [r['method'] for r in results]
    
    # Save to Excel
    df.to_excel(output_path, index=False)
    
    return df

`


Usage

Basic Usage

`from manzana_matcher import ManzanaMatcher

Initialize matcher with your shapefiles

matcher = ManzanaMatcher( streets_shapefile='data/streets.shp', manzanas_shapefile='data/manzanas.shp' )

Match a single address

address = "CALLE 39 # 2012 / 20 Y 22" result = matcher.find_manzana(address)

print(f"Manzana: {result['manzana']}") print(f"Confidence: {result['confidence']}") print(f"Method: {result['method']}") `

Batch Processing

`# Process entire Excel file output_df = matcher.process_excel( input_path='input/addresses.xlsx', output_path='output/addresses_with_manzanas.xlsx', address_column='DIRECCION' )

View statistics

print(f"Total addresses: {len(output_df)}") print(f"Matched: {output_df['MANZANA_PREDICTED'].notna().sum()}") print(f"Average confidence: {output_df['CONFIDENCE'].mean():.2f}") `

API Integration

`from flask import Flask, request, send_file from manzana_matcher import ManzanaMatcher import tempfile

app = Flask(name) matcher = ManzanaMatcher('data/streets.shp', 'data/manzanas.shp')

@app.route('/api/manzanify', methods=['POST']) def manzanify(): # Get uploaded file file = request.files['excel']

# Save to temp location
with tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx') as tmp_in:
    file.save(tmp_in.name)
    
    # Process
    with tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx') as tmp_out:
        matcher.process_excel(tmp_in.name, tmp_out.name)
        return send_file(tmp_out.name, as_attachment=True)

`


Configuration

Critical Configuration Points

Before running the system, you must configure these settings in the code:

1. Shapefile Column Names

`# In find_street() method (line 72) street_col = 'name' # Change to your streets column name

Common alternatives: 'NAME', 'street_nam', 'CALLE', 'nom_calle'

In find_manzana() method (line 195)

manzana_col = 'manzana' # Change to your manzanas column name

Common alternatives: 'MANZANA', 'codigo', 'id', 'block_id'

`

How to find your column names:

`import geopandas as gpd

Check streets columns

streets = gpd.read_file('streets.shp') print("Streets columns:", streets.columns.tolist())

Check manzanas columns

manzanas = gpd.read_file('manzanas.shp') print("Manzanas columns:", manzanas.columns.tolist()) `

2. Buffer Distance

# In find_manzana() method (line 166) buffer_distance = 50 # meters

How to calibrate:

  • Measure typical manzana width in your shapefiles
  • Start with 50m and adjust based on results
  • Too small: misses correct manzanas
  • Too large: includes too many candidates

3. House Number Side Logic

`# In determine_side() method (line 144-154) def determine_side(self, house_number): if house_number is None: return 'both'

num = int(house_number)
if num % 2 == 0:
    return 'right'  # Even numbers on right
else:
    return 'left'   # Odd numbers on left

`

Cuban numbering conventions:

  • Verify if even/odd rule applies in Cuba
  • Some cities use sequential numbering
  • May vary by neighborhood
  • Consider disabling side logic initially (always return 'both')

Limitations & Future Work

Current Limitations

  1. Building-Only Addresses
    • Addresses like EDF 44 APT 5 JS cannot be spatially matched
    • Require separate lookup table or manual mapping
    • Currently return None with low confidence
  2. Missing Cross Streets
    • Addresses without cross streets use entire street geometry
    • Lower confidence scores
    • May match to wrong manzana if street is long
  3. House Numbering Assumptions
    • Even/odd side logic may not apply in all Cuban cities
    • Sequential numbering not well-handled
    • May need city-specific calibration
  4. Shapefile Quality Dependency
    • Requires accurate street geometries
    • Streets must actually intersect in the data
    • Manzana boundaries must align with streets
  5. Performance
    • Not optimized for real-time processing
    • Large batches (10,000+ addresses) may be slow
    • No spatial indexing implemented