Cuban Address Manzanification System
Table of Contents
- Overview
- What is Manzanification?
- Algorithm Research & Selection
- Installation
- Implementation
- Usage
- Configuration
- Limitations & Future Work
Overview
This system automates the process of matching Cuban street addresses to their corresponding manzanas (housing blocks). The system is designed to process batches of addresses from Excel files and return enriched data with manzana assignments and confidence scores.
Use Case Flow
- User uploads Excel file with addresses
- Backend processes addresses through manzanification engine
- System returns Excel with added manzana column
- User reviews/corrects results in separate validation tool
What is Manzanification?
Manzanification is the process of geocoding Cuban addresses to their administrative housing block units called manzanas.
A manzana is a city block or housing unit used in Cuban urban planning and administration. Each manzana has a unique identifier (e.g., 781A, 817, 676).
Cuban Address Structure
Cuban addresses typically follow these patterns:
| Pattern | Example | Components |
|---|---|---|
| Street + Number + Cross Streets | CALLE 39 # 2012 / 20 Y 22 | Main street, house number, intersecting streets |
| Avenue + Number | AVE 48 # 6116 / 61 Y 63 | Avenue, house number, cross streets |
| Building + Apartment | EDF 44 APT 5 JS | Building number, apartment, settlement code |
| Street + Cross Streets (no number) | CALLE 61 / 20 Y 22 | Main street, cross streets only |
The cross streets (e.g., "20 Y 22") define the block boundaries, making spatial matching feasible.
Algorithm Research & Selection
We evaluated four spatial matching algorithms for this project:
Algorithm 1: Point-in-Polygon
Approach: Generate a point from the address coordinates and check which manzana polygon contains it.
Process:
- Find main street geometry
- Find intersection point of cross streets
- Use house number to offset point along the street
- Check which manzana polygon contains this point
Pros:
- Simple and fast
- Low computational overhead
Cons:
- Requires accurate point generation
- Assumes house numbers are evenly distributed
- Fails if point falls on boundary or in gaps
- No handling of ambiguous cases
Verdict: Too brittle for real-world Cuban address data
Algorithm 2: Street Segment + Buffer (SELECTED)
Approach: Find the street segment between cross streets, buffer it, and intersect with manzanas.
Process:
- Get main street LineString (e.g., "C-39")
- Find intersections with cross streets ("C-20" and "C-22")
- Extract street segment between these two points
- Determine which side using house number logic
- Buffer the segment appropriately
- Find manzana(s) that intersect this buffer
- Return manzana with highest overlap area
Pros:
- Uses actual street geometry, not just points
- Cross streets provide precise block boundaries
- Buffer handles boundary inaccuracies in shapefiles
- Works even if house numbers aren't perfectly sequential
- Returns confidence scores based on overlap
- Graceful degradation when cross streets missing
Cons:
- Requires calibration of buffer distance
- Assumes street geometries are relatively accurate
- Side determination logic may need adjustment
Verdict: ✅ SELECTED - Best balance of accuracy, robustness, and maintainability
Algorithm 3: Nearest Neighbor with Constraints
Approach: Find manzanas near the intersection, rank by distance and constraints.
Process:
- Find intersection of main street with cross streets
- Get all manzanas within X meters of intersection
- Filter manzanas that actually touch the main street
- Use house number range to pick the right one
- Return closest match
Pros:
- Handles messy/incomplete data well
- Doesn't require perfect street geometries
- Good for quick prototyping
Cons:
- Requires manual tuning of distance threshold
- May return wrong manzana if threshold too large
- Less precise than segment-based approach
- Doesn't use full geometric information
Verdict: Good fallback option, but less precise than Algorithm 2
Algorithm 4: Graph-Based Network Analysis
Approach: Model streets as a network graph and navigate to the address location.
Process:
- Build street network graph from shapefiles
- Find intersection node (cross streets + main street)
- Navigate along main street for estimated distance
- Find manzana at that graph position
Pros:
- Most accurate for complex urban layouts
- Handles one-way streets, connectivity issues
- Good for routing applications
Cons:
- Significant implementation complexity
- Higher computational overhead
- Requires clean, well-connected street network
- Overkill for static address matching
Verdict: Too complex for current requirements; revisit if routing needed
Installation
Required Python Libraries
pip install geopandas pandas shapely openpyxl
Library Descriptions
| Library | Purpose | Documentation |
|---|---|---|
geopandas | Reading shapefiles, spatial operations | docs |
pandas | Excel I/O, data manipulation | docs |
shapely | Geometric operations (buffers, intersections) | docs |
openpyxl | Excel file reading/writing | docs |
Additional Dependencies
These are typically installed automatically with geopandas:
fiona- Shapefile I/Opyproj- Coordinate system transformationsrtree- Spatial indexing for performance
Implementation
Sample Code
Below is the complete implementation of the Street Segment + Buffer algorithm:
`import geopandas as gpd import pandas as pd from shapely.geometry import Point, LineString from shapely.ops import nearest_points import re
class ManzanaMatcher: def init(self, streets_shapefile, manzanas_shapefile): """ Initialize the matcher with shapefiles.
Args:
streets_shapefile: Path to streets shapefile
manzanas_shapefile: Path to manzanas shapefile
"""
self.streets = gpd.read_file(streets_shapefile)
self.manzanas = gpd.read_file(manzanas_shapefile)
# Ensure same CRS
if self.streets.crs != self.manzanas.crs:
self.streets = self.streets.to_crs(self.manzanas.crs)
def parse_address(self, address):
"""
Parse Cuban address into components.
Returns dict with:
- main_street: e.g., "C-39" or "A-46"
- cross_street_1: e.g., "20"
- cross_street_2: e.g., "22"
- house_number: e.g., "2406"
- building: e.g., "44"
- apartment: e.g., "5"
"""
result = {
'main_street': None,
'cross_street_1': None,
'cross_street_2': None,
'house_number': None,
'building': None,
'apartment': None
}
# Extract main street (CALLE XX or AVE XX)
street_match = re.search(r'(CALLE|AVE|C-|A-)(\d+)', address, re.IGNORECASE)
if street_match:
prefix = 'C-' if 'C' in street_match.group(1).upper() else 'A-'
result['main_street'] = f"{prefix}{street_match.group(2)}"
# Extract house number (# XXXX)
number_match = re.search(r'#?\s*(\d{3,4})(?!\s*APT)', address)
if number_match:
result['house_number'] = number_match.group(1)
# Extract cross streets (/ XX Y XX or ENTRE XX Y XX)
cross_match = re.search(r'[/|ENTRE]\s*(\d+)\s*[YyXx]\s*(\d+)', address, re.IGNORECASE)
if cross_match:
result['cross_street_1'] = cross_match.group(1)
result['cross_street_2'] = cross_match.group(2)
# Extract building (EDF XX or EDIFICIO XX)
building_match = re.search(r'(?:EDF|EDIFICIO)\s*(\d+)', address, re.IGNORECASE)
if building_match:
result['building'] = building_match.group(1)
# Extract apartment
apt_match = re.search(r'APT\s*(\d+)', address, re.IGNORECASE)
if apt_match:
result['apartment'] = apt_match.group(1)
return result
def find_street(self, street_name):
"""Find street geometry by name."""
# Adjust column name based on your shapefile
# Common names: 'name', 'NAME', 'street_nam', 'CALLE'
street_col = 'name' # CHANGE THIS to match your shapefile
matches = self.streets[self.streets[street_col].str.contains(
street_name, case=False, na=False)]
if len(matches) == 0:
return None
return matches.iloc[0].geometry
def find_intersection(self, main_street_geom, cross_street_1, cross_street_2):
"""
Find the segment of main street between two cross streets.
Returns:
LineString segment between the two cross streets
"""
# Find cross street geometries
cross_geom_1 = self.find_street(f"C-{cross_street_1}")
cross_geom_2 = self.find_street(f"C-{cross_street_2}")
if cross_geom_1 is None or cross_geom_2 is None:
# Try with A- prefix
if cross_geom_1 is None:
cross_geom_1 = self.find_street(f"A-{cross_street_1}")
if cross_geom_2 is None:
cross_geom_2 = self.find_street(f"A-{cross_street_2}")
if cross_geom_1 is None or cross_geom_2 is None:
return None
# Find intersection points
int_point_1 = main_street_geom.intersection(cross_geom_1)
int_point_2 = main_street_geom.intersection(cross_geom_2)
if int_point_1.is_empty or int_point_2.is_empty:
return None
# Get the actual point (in case intersection returns multiple geometries)
if hasattr(int_point_1, 'geoms'):
int_point_1 = list(int_point_1.geoms)[0]
if hasattr(int_point_2, 'geoms'):
int_point_2 = list(int_point_2.geoms)[0]
# Create segment between intersection points
segment = LineString([int_point_1, int_point_2])
return segment
def determine_side(self, house_number):
"""
Determine which side of street based on house number.
Returns: 'left' or 'right' or 'both'
"""
if house_number is None:
return 'both'
# Even/odd logic (common in many cities)
# Adjust this based on Cuban numbering conventions
num = int(house_number)
if num % 2 == 0:
return 'right'
else:
return 'left'
def find_manzana(self, address):
"""
Main function to find manzana for an address.
Returns:
dict with 'manzana' code and 'confidence' score
"""
# Parse address
parsed = self.parse_address(address)
# Handle building-only addresses (like EDF 44 APT 5 JS)
if parsed['building'] and not parsed['main_street']:
# For building-based addresses, you might need a separate lookup
# or manual mapping. For now, return low confidence.
return {'manzana': None, 'confidence': 0.0, 'method': 'building_only'}
if not parsed['main_street']:
return {'manzana': None, 'confidence': 0.0, 'method': 'no_street'}
# Find main street
main_street_geom = self.find_street(parsed['main_street'])
if main_street_geom is None:
return {'manzana': None, 'confidence': 0.0, 'method': 'street_not_found'}
# Find street segment between cross streets
if parsed['cross_street_1'] and parsed['cross_street_2']:
segment = self.find_intersection(
main_street_geom,
parsed['cross_street_1'],
parsed['cross_street_2']
)
if segment is None:
# Fallback: use entire street
segment = main_street_geom
confidence_penalty = 0.3
else:
confidence_penalty = 0.0
else:
# No cross streets, use entire main street
segment = main_street_geom
confidence_penalty = 0.5
# Determine side
side = self.determine_side(parsed['house_number'])
# Buffer the segment (adjust buffer distance based on your data)
buffer_distance = 50 # meters, adjust as needed
if side == 'both':
buffered = segment.buffer(buffer_distance)
elif side == 'right':
# Buffer only to the right
buffered = segment.buffer(buffer_distance, single_sided=False)
else: # left
# Buffer only to the left
buffered = segment.buffer(buffer_distance, single_sided=False)
# Find manzanas that intersect the buffer
# Adjust column name based on your shapefile
manzana_col = 'manzana' # CHANGE THIS to match your shapefile
intersecting = self.manzanas[self.manzanas.intersects(buffered)]
if len(intersecting) == 0:
return {'manzana': None, 'confidence': 0.0, 'method': 'no_intersection'}
# Calculate overlap area for each candidate
intersecting = intersecting.copy()
intersecting['overlap'] = intersecting.geometry.apply(
lambda geom: geom.intersection(buffered).area
)
# Get the manzana with highest overlap
best_match = intersecting.loc[intersecting['overlap'].idxmax()]
manzana_code = best_match[manzana_col]
# Calculate confidence based on:
# - How many candidates there were
# - Whether cross streets were found
# - Overlap ratio
confidence = 1.0 - confidence_penalty
if len(intersecting) > 1:
confidence *= 0.8
return {
'manzana': manzana_code,
'confidence': round(confidence, 2),
'method': 'spatial_match',
'candidates': len(intersecting)
}
def process_excel(self, input_path, output_path, address_column='DIRECCION'):
"""
Process an Excel file with addresses and add manzana column.
Args:
input_path: Path to input Excel file
output_path: Path to output Excel file
address_column: Name of column containing addresses
"""
# Read Excel
df = pd.read_excel(input_path)
# Process each address
results = []
for address in df[address_column]:
result = self.find_manzana(address)
results.append(result)
# Add results to dataframe
df['MANZANA_PREDICTED'] = [r['manzana'] for r in results]
df['CONFIDENCE'] = [r['confidence'] for r in results]
df['MATCH_METHOD'] = [r['method'] for r in results]
# Save to Excel
df.to_excel(output_path, index=False)
return df`
Usage
Basic Usage
`from manzana_matcher import ManzanaMatcher
Initialize matcher with your shapefiles
matcher = ManzanaMatcher( streets_shapefile='data/streets.shp', manzanas_shapefile='data/manzanas.shp' )
Match a single address
address = "CALLE 39 # 2012 / 20 Y 22" result = matcher.find_manzana(address)
print(f"Manzana: {result['manzana']}") print(f"Confidence: {result['confidence']}") print(f"Method: {result['method']}") `
Batch Processing
`# Process entire Excel file output_df = matcher.process_excel( input_path='input/addresses.xlsx', output_path='output/addresses_with_manzanas.xlsx', address_column='DIRECCION' )
View statistics
print(f"Total addresses: {len(output_df)}") print(f"Matched: {output_df['MANZANA_PREDICTED'].notna().sum()}") print(f"Average confidence: {output_df['CONFIDENCE'].mean():.2f}") `
API Integration
`from flask import Flask, request, send_file from manzana_matcher import ManzanaMatcher import tempfile
app = Flask(name) matcher = ManzanaMatcher('data/streets.shp', 'data/manzanas.shp')
@app.route('/api/manzanify', methods=['POST']) def manzanify(): # Get uploaded file file = request.files['excel']
# Save to temp location
with tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx') as tmp_in:
file.save(tmp_in.name)
# Process
with tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx') as tmp_out:
matcher.process_excel(tmp_in.name, tmp_out.name)
return send_file(tmp_out.name, as_attachment=True)`
Configuration
Critical Configuration Points
Before running the system, you must configure these settings in the code:
1. Shapefile Column Names
`# In find_street() method (line 72) street_col = 'name' # Change to your streets column name
Common alternatives: 'NAME', 'street_nam', 'CALLE', 'nom_calle'
In find_manzana() method (line 195)
manzana_col = 'manzana' # Change to your manzanas column name
Common alternatives: 'MANZANA', 'codigo', 'id', 'block_id'
`
How to find your column names:
`import geopandas as gpd
Check streets columns
streets = gpd.read_file('streets.shp') print("Streets columns:", streets.columns.tolist())
Check manzanas columns
manzanas = gpd.read_file('manzanas.shp') print("Manzanas columns:", manzanas.columns.tolist()) `
2. Buffer Distance
# In find_manzana() method (line 166) buffer_distance = 50 # meters
How to calibrate:
- Measure typical manzana width in your shapefiles
- Start with 50m and adjust based on results
- Too small: misses correct manzanas
- Too large: includes too many candidates
3. House Number Side Logic
`# In determine_side() method (line 144-154) def determine_side(self, house_number): if house_number is None: return 'both'
num = int(house_number)
if num % 2 == 0:
return 'right' # Even numbers on right
else:
return 'left' # Odd numbers on left`
Cuban numbering conventions:
- Verify if even/odd rule applies in Cuba
- Some cities use sequential numbering
- May vary by neighborhood
- Consider disabling side logic initially (always return 'both')
Limitations & Future Work
Current Limitations
- Building-Only Addresses
- Addresses like
EDF 44 APT 5 JScannot be spatially matched - Require separate lookup table or manual mapping
- Currently return
Nonewith low confidence
- Addresses like
- Missing Cross Streets
- Addresses without cross streets use entire street geometry
- Lower confidence scores
- May match to wrong manzana if street is long
- House Numbering Assumptions
- Even/odd side logic may not apply in all Cuban cities
- Sequential numbering not well-handled
- May need city-specific calibration
- Shapefile Quality Dependency
- Requires accurate street geometries
- Streets must actually intersect in the data
- Manzana boundaries must align with streets
- Performance
- Not optimized for real-time processing
- Large batches (10,000+ addresses) may be slow
- No spatial indexing implemented