[{"content":"","date":"11 May 2026","externalUrl":null,"permalink":"/tags/databases/","section":"Tags","summary":"","title":"Databases","type":"tags"},{"content":"","date":"11 May 2026","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"","date":"11 May 2026","externalUrl":null,"permalink":"/tags/sqlalchemy/","section":"Tags","summary":"","title":"Sqlalchemy","type":"tags"},{"content":" If you\u0026rsquo;ve ever enabled SQL echo in a SQLAlchemy app and watched in horror as a simple for loop fires off 200 queries, you\u0026rsquo;ve met the N+1 problem. This post is about understanding why that happens and choosing between the two most common fixes: selectinload and joinedload. Spoiler: one of them hides a mathematical time bomb.\nRelationship Basics # SQLAlchemy\u0026rsquo;s relationship() construct links two mapped classes together. For example, a User that has many Address objects:\nfrom sqlalchemy.orm import Mapped, relationship class User(Base): __tablename__ = \u0026#34;user_account\u0026#34; id: Mapped[int] = mapped_column(primary_key=True) name: Mapped[str] addresses: Mapped[list[\u0026#34;Address\u0026#34;]] = relationship(back_populates=\u0026#34;user\u0026#34;) class Address(Base): __tablename__ = \u0026#34;address\u0026#34; id: Mapped[int] = mapped_column(primary_key=True) user_id: Mapped[int] = mapped_column(ForeignKey(\u0026#34;user_account.id\u0026#34;)) email_address: Mapped[str] user: Mapped[\u0026#34;User\u0026#34;] = relationship(back_populates=\u0026#34;addresses\u0026#34;) By default, SQLAlchemy uses lazy loading: accessing user.addresses fires a SELECT the first time you touch it. This is convenient in development but becomes a problem in production when you\u0026rsquo;re iterating over a list of users:\n# This looks innocent... users = session.execute(select(User)).scalars().all() for user in users: print(user.addresses) # ...but each iteration fires a new SELECT! With 100 users, that\u0026rsquo;s 101 queries - 1 to load the users, then 1 per user to load their addresses. This is the N+1 problem, and it scales beautifully in the wrong direction.\nSQLAlchemy ships with loader strategies to tackle this. The two heavyweights are selectinload and joinedload.\nselectinload - One Query Per Relationship # selectinload solves N+1 by loading each relationship in a separate query, but crucially it loads the entire collection in one shot using a WHERE ... IN (...) clause:\nfrom sqlalchemy.orm import selectinload stmt = ( select(User) .options(selectinload(User.addresses)) # load all addresses in a second query .order_by(User.id) ) users = session.execute(stmt).scalars().all() The SQL that fires:\n-- Query 1: load all users SELECT user_account.id, user_account.name FROM user_account ORDER BY user_account.id; -- Query 2: load all addresses for those users SELECT address.user_id, address.id, address.email_address FROM address WHERE address.user_id IN (1, 2, 3, 4, 5, 6); Two queries. Done. The IN (...) clause bundles all the parent IDs into a single round-trip.\nYou can chain multiple relationships too - typically one additional query per relationship (though large result sets may be split into multiple batched queries):\nstmt = ( select(User) .options( selectinload(User.addresses), selectinload(User.orders), selectinload(User.preferences), ) ) That\u0026rsquo;s 4 queries total, regardless of how many users you\u0026rsquo;re loading. Predictable, honest, easy to reason about.\njoinedload - One Query With JOINs # joinedload takes a different approach: it augments the original query with a JOIN so that related objects are returned in the same result set:\nfrom sqlalchemy.orm import joinedload stmt = ( select(Address) .options(joinedload(Address.user, innerjoin=True)) # join user onto each address row .order_by(Address.id) ) addresses = session.execute(stmt).scalars().all() The SQL:\nSELECT address.id, address.email_address, address.user_id, user_account_1.id AS id_1, user_account_1.name FROM address JOIN user_account AS user_account_1 ON user_account_1.id = address.user_id ORDER BY address.id; One query. SQLAlchemy aliases the joined table internally (note user_account_1) so it doesn\u0026rsquo;t interfere with any filtering you\u0026rsquo;ve already applied to the main query.\nThis works brilliantly for many-to-one relationships - loading the \u0026ldquo;owner\u0026rdquo; of an object, like the parent User for each Address. Each address row adds a few extra columns from the user table. Clean and efficient.\nThe trouble starts when you flip it around.\nThe Cartesian Product Problem # For one-to-many relationships - loading a \u0026ldquo;list\u0026rdquo; of children, like all addresses belonging to a user - joinedload has a serious sting in the tail. Consider loading users and their addresses via a join:\nSELECT user_account.*, address.* FROM user_account LEFT OUTER JOIN address ON user_account.id = address.user_id; If a user has 10 addresses, that user\u0026rsquo;s data appears 10 times in the result set. SQLAlchemy deduplicates the User objects for you - so session.execute(stmt).scalars().all() still returns N parent objects, not N × M². But deduplication happens in Python, not the database: the full inflated result set still travels over the wire, is held in memory, and must be processed by the ORM. The result looks right; the cost is invisible.\nSQLAlchemy 1.x vs 2.0: deduplication behaviour changed In SQLAlchemy 1.x, joinedload on a collection could produce duplicate parent objects in the result list. You had to call .unique() explicitly (or use the legacy Query API, which applied it automatically) to strip them out. In 2.0, session.execute(stmt).scalars().all() handles uniquing for you - but the ORM is still doing that work in Python after receiving the full inflated result set. The cost is hidden, not eliminated.\nNow add a second one-to-many join - say, orders:\nSELECT user_account.*, address.*, orders.* FROM user_account LEFT OUTER JOIN address ON user_account.id = address.user_id LEFT OUTER JOIN orders ON user_account.id = orders.user_id; A user with 10 addresses and 10 orders now appears 100 times in the result set. Each join multiplies the row count. This is the Cartesian product problem, and the blowup compounds multiplicatively with every additional one-to-many join.\nBack of Napkin Maths # Let\u0026rsquo;s put some numbers to this to understand exactly how badly things can blow up.\nAssume:\nN = number of parent rows (e.g., users) R = number of one-to-many relationships M = average number of related items per parent per relationship selectinload # Total queries: 1 + R\nTotal rows fetched across all queries:\n$$\\text{rows} = N + R \\cdot N \\cdot M = N(1 + R \\cdot M)$$This is linear in both R and M - adding more relationships or more items per relationship grows costs predictably.\njoinedload # Total queries: 1\nTotal rows in the result set:\n$$\\text{rows} = N \\cdot (M_1 \\times M_2 \\times \\cdots \\times M_R)$$where \\(M_i\\) is the average collection size for relationship \\(i\\). In the common case where every relationship averages \\(M\\) rows this simplifies to \\(N \\cdot M^R\\) - growing multiplicatively with each additional collection join (effectively exponential in \\(R\\). Each one-to-many join multiplies the row count by that relationship\u0026rsquo;s average size.\nA Concrete Comparison # Let\u0026rsquo;s say you have 100 users, 3 one-to-many relationships, and 10 items per relationship on average:\nStrategy Queries Rows Fetched selectinload 4 100 + (3 × 100 × 10) = 3,100 joinedload 1 100 × 10³ = 100,000 joinedload fetches 32× more data for the same result, and it only gets worse as M or R grows. Crank M to 20 and the joinedload result hits 800,000 rows - a 258× blowup.\nThe single-query advantage of joinedload is completely wiped out by the data explosion. Network round-trips are cheap; transferring 800,000 rows is not.\nWhen joinedload Does Win # joinedload shines for many-to-one relationships - loading the \u0026ldquo;owner\u0026rdquo; of each object, a single parent per child. There\u0026rsquo;s no multiplication - you\u0026rsquo;re just adding a fixed number of columns per row. If you\u0026rsquo;re loading 1,000 addresses and want the associated user on each, joinedload(Address.user) is the perfect tool: one query, minimal extra data, done.\nThe rule of thumb: loading an \u0026ldquo;owner\u0026rdquo; (many-to-one) → joinedload, loading a \u0026ldquo;list\u0026rdquo; (one-to-many) → selectinload. The exception: if a collection is known to be very small or tightly bounded (say, \u0026lt;2–3 items), or if you need to filter or order by a joined column, joinedload can still be reasonable for one-to-many.\nConclusion # Lazy loading fires a query every time you access an unloaded relationship - fine in isolation, catastrophic in a loop. selectinload adds one query per relationship using WHERE ... IN (...). Total data scales linearly. The go-to choice for one-to-many collections. joinedload adds a JOIN to the main query. Great for many-to-one (single parent per child), but causes Cartesian product explosions on one-to-many joins - data grows as the product of each relationship\u0026rsquo;s size. Deduplication is handled by the ORM in Python, so the result count looks correct, but the memory and transfer cost is real. The maths doesn\u0026rsquo;t lie: with 3 relationships of 10 items each, joinedload fetches 32× more rows than selectinload for the same result. When in doubt, reach for selectinload. If you\u0026rsquo;re ever tempted to use joinedload on a one-to-many, do the napkin maths first - and maybe keep the napkin nearby when you show your DBA the query plan.\nResources # SQLAlchemy Relationship Loading Techniques Select IN Loading Joined Eager Loading The Zen of Joined Eager Loading ","date":"11 May 2026","externalUrl":null,"permalink":"/blog/sqlalchemy-selectinload-vs-joinedload/","section":"Blog","summary":"","title":"SQLAlchemy Relationships: selectinload vs joinedload","type":"blog"},{"content":"","date":"11 May 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","date":"14 March 2026","externalUrl":null,"permalink":"/tags/code/","section":"Tags","summary":"","title":"Code","type":"tags"},{"content":"Server-Sent Events (SSE) enable a one-way, real-time HTTP stream from server to client. Unlike WebSockets, SSE is unidirectional and leverages standard HTTP, making it ideal for live updates, notifications, and progress tracking. Browsers natively support SSE via the EventSource API.\nFastAPI’s Built-in SSE Support # FastAPI simplifies SSE implementation with its fastapi.sse module:\nfrom fastapi.sse import EventSourceResponse, ServerSentEvent EventSourceResponse: Use this as the response_class on your route. It automatically handles:\nProper event-stream framing Content-Type: text/event-stream headers Keep-alive pings every 15 seconds Cache-Control and X-Accel-Buffering headers ServerSentEvent: Wraps your data. Supports fields like data, event, id, retry, and comment.\nThe Route as a Generator # FastAPI allows your route to be an async generator. This pattern is clean and efficient:\n@app.get(\u0026#34;/rt\u0026#34;, response_class=EventSourceResponse) async def road_trip(request: Request) -\u0026gt; AsyncIterable[ServerSentEvent]: \u0026#34;\u0026#34;\u0026#34;Says \u0026#39;Are we there yet?\u0026#39; event every 3 seconds for 4 hours then says \u0026#39;Yes!\u0026#39;.\u0026#34;\u0026#34;\u0026#34; total_time = 0 while True: if await request.is_disconnected(): break yield ServerSentEvent(data=\u0026#34;Are we there yet?\u0026#34;) await asyncio.sleep(3) total_time += 3 if total_time \u0026gt;= 4 * 60 * 60: # change this to try it out. yield ServerSentEvent(data=\u0026#34;Yes!\u0026#34;) await asyncio.sleep(1) yield ServerSentEvent(data=\u0026#34;really?\u0026#34;) await asyncio.sleep(1) yield ServerSentEvent(data=\u0026#34;NO!\u0026#34;) break Key Point: FastAPI automatically handles the EventSourceResponse when you use response_class. No need to manually instantiate it.\nBroadcasting to Multiple Clients # A single asyncio.Queue won’t work for multiple clients-only one consumer receives each item. Instead, use a set of per-client queues:\n_subscribers: set[asyncio.Queue[str]] = set() This will allow multiple clients to receive events simultaneously, so long as the publisher pushes to the relevant queue. To determine which queue to push to, the publisher needs to know which client is interested in which event - this could be solved with mapping of queues on registration (by using a dict of queues for example). For this example we will stick to a globally broadcasted queue.\nHow It Works # Client Connection: Each client registers its own queue. Broadcasting: A /push endpoint fans out messages to all queues: @app.post(\u0026#34;/push\u0026#34;) async def push_event(request: Request, msg: Annotated[str, Query(...)]) -\u0026gt; None: ts = datetime.now(UTC).isoformat() ip = request.client.host if request.client else \u0026#34;unknown\u0026#34; port = request.client.port if request.client else -1 payload = f\u0026#34;[{ts}] [{ip}:{port}] {msg}\u0026#34; for q in _subscribers: await q.put(payload) SSE Handler: The handler adds/removes itself from the set: queue: asyncio.Queue[str] = asyncio.Queue() _subscribers.add(queue) try: payload = await queue.get() # Blocks until a push arrives yield ServerSentEvent(data=payload) # servers can close the queue by `return`ing some data finally: _subscribers.discard(queue) Note: queue.get() blocks naturally-no polling or sleeping is needed between events.\nDisconnect Detection # Always check for disconnections before blocking on queue.get():\nif await request.is_disconnected(): break Why? Without this, a disconnected client’s queue grows indefinitely, and cleanup only happens when the next push arrives. The try/finally block ensures the queue is always removed from _subscribers.\nCommon Pitfalls # Yielding a Response Object: Avoid yield EventSourceResponse(...). Instead, use the response_class + generator pattern. Sharing a Single Queue: Each queue.get() consumes the item-only one client receives it. Use per-client queues. Unreferenced Tasks: Store asyncio.create_task references to prevent garbage collection mid-execution. Use a set and add_done_callback for cleanup. Full Example # from fastapi import FastAPI, Request from fastapi.sse import EventSourceResponse, ServerSentEvent from typing import AsyncIterable from datetime import datetime, UTC from fastapi import Query import asyncio import logging logger = logging.getLogger(__name__) _subscribers: set[asyncio.Queue[str]] = set() app = FastAPI() @app.post(\u0026#34;/push\u0026#34;) async def push_event(request: Request, msg: str = Query(...)) -\u0026gt; None: ts = datetime.now(UTC).isoformat() ip = request.client.host if request.client else \u0026#34;unknown\u0026#34; port = request.client.port if request.client else -1 payload = f\u0026#34;[{ts}] [{ip}:{port}] {msg}\u0026#34; logger.info(f\u0026#34;push_event: {payload}\u0026#34;) for q in _subscribers: await q.put(payload) @app.get(\u0026#34;/sse\u0026#34;, response_class=EventSourceResponse) async def sse_stream(request: Request) -\u0026gt; AsyncIterable[ServerSentEvent]: logger.info(\u0026#34;sse_stream: connected\u0026#34;) queue: asyncio.Queue[str] = asyncio.Queue() _subscribers.add(queue) try: while True: if await request.is_disconnected(): break payload = await queue.get() logger.info(f\u0026#34;sse_stream: {payload}\u0026#34;) yield ServerSentEvent(data=payload) finally: _subscribers.discard(queue) logger.info(\u0026#34;sse_stream: disconnected\u0026#34;) Ready to implement SSE in your FastAPI project? This approach ensures scalability, clean disconnects, and efficient broadcasting. Try changing the source of events pushing to the queue, or try building a selective event filtering mechanism for each queue. Below is an example of a subscriber-based SSE approach:\nimport asyncio import uuid from abc import abstractmethod from collections import defaultdict from collections.abc import AsyncGenerator, AsyncIterable from typing import Annotated from fastapi import Body, FastAPI, Request from fastapi.sse import EventSourceResponse, ServerSentEvent from pydantic import BaseModel, Field class Notification(BaseModel): id: uuid.UUID | None = Field(default_factory=uuid.uuid4) data_1: str value: float class EventService[K, T]: def __init__(self): self._lock = asyncio.Lock() @abstractmethod async def pub(self, topic: K, data: T): ... @abstractmethod def sub(self, topic: K) -\u0026gt; AsyncGenerator[T]: ... class MemoryEventService[K, T](EventService[K, T]): def __init__(self): self._topics: dict[K, set[asyncio.Queue[T]]] = defaultdict(set) super().__init__() async def pub(self, topic: K, data: T): qs = list(self._topics[topic]) await asyncio.gather(*[q.put(data) for q in qs]) async def sub(self, topic: K) -\u0026gt; AsyncGenerator[T]: q = asyncio.Queue[T]() self._topics[topic].add(q) try: while True: data = await q.get() yield data finally: async with self._lock: self._topics[topic].remove(q) NOTIF_SERVICE = MemoryEventService[str, Notification]() app = FastAPI() @app.post(\u0026#34;/send_data/{topic}\u0026#34;) async def send_data(topic: str, notification: Annotated[Notification, Body()]): await NOTIF_SERVICE.pub(topic, notification) return notification @app.get(\u0026#34;/topics\u0026#34;) async def get_sse_topics() -\u0026gt; list[str]: return list(NOTIF_SERVICE._topics.keys()) @app.get(\u0026#34;/sse/{topic}\u0026#34;, response_class=EventSourceResponse) async def get_room_notifications( request: Request, topic: str, ) -\u0026gt; AsyncIterable[ServerSentEvent]: \u0026#34;\u0026#34;\u0026#34; A SSE Endpoint which listens to generation job events for a room and notifies on any changes. \u0026#34;\u0026#34;\u0026#34; # Validate room access BEFORE returning the EventSourceResponse q = NOTIF_SERVICE.sub(topic) async for event in q: if await request.is_disconnected(): break yield ServerSentEvent(data=event) if __name__ == \u0026#34;__main__\u0026#34;: import uvicorn uvicorn.run(app, host=\u0026#34;0.0.0.0\u0026#34;, port=8100) ","date":"14 March 2026","externalUrl":null,"permalink":"/blog/sse-server-sent-events-in-fastapi/","section":"Blog","summary":"","title":"Implementing Server-Sent Events (SSE) in FastAPI","type":"blog"},{"content":"Optimising Python code without profiling is like navigating a maze blindfolded - you might get lucky, but you’ll probably waste time.\nProfiling is the process of measuring how your code performs, whether it\u0026rsquo;s tracking execution time or memory usage. Without it, you’re guessing where the bottlenecks are, and guesses are often wrong. In this post, we’ll explore how to profile Python code for both time and memory usage, interpret the results, and use that data to make your code faster and more efficient.\nWhy Profile Python Code? # Profiling helps you answer critical questions about your code’s performance:\nIs your function slow because it’s doing too much work, or because it’s calling an inefficient library? Is your code using too much memory, and if so, where is that memory being allocated? Are there hidden inefficiencies in your algorithms or data structures? By profiling your code, you can focus your optimisation efforts where they matter most-saving time and frustration.\nBuilt-in Python Profiling Tools # Python provides two powerful built-in tools for profiling:\ncProfile: Measures execution time and function call statistics. tracemalloc: Tracks memory allocations and identifies memory leaks. Let’s combine these tools into a reusable context manager that profiles both time and memory in a single run.\nThe profile_code Context Manager # Here’s a context manager that profiles execution time and memory usage:\nimport cProfile import io import linecache import pstats import tracemalloc from contextlib import contextmanager from textwrap import dedent from typing import Literal @contextmanager def profile_code(include: tuple[Literal[\u0026#34;time\u0026#34;, \u0026#34;memory\u0026#34;], ...] = (\u0026#34;time\u0026#34;, \u0026#34;memory\u0026#34;)): \u0026#34;\u0026#34;\u0026#34; Profile execution time using cProfile and tracemalloc. Args: include: A tuple of strings specifying what to profile (\u0026#34;time\u0026#34;, \u0026#34;memory\u0026#34;, or both). \u0026#34;\u0026#34;\u0026#34; print(\u0026#34;=\u0026#34; * 60) print(f\u0026#34;{\u0026#39; \u0026amp; \u0026#39;.join(include).upper()} PROFILING\u0026#34;) print(\u0026#34;-\u0026#34; * 60) # Create profiler profiler = cProfile.Profile() # Start profiling if \u0026#34;memory\u0026#34; in include: tracemalloc.start() if \u0026#34;time\u0026#34; in include: profiler.enable() yield # Stop profiling and print results if \u0026#34;time\u0026#34; in include: profiler.disable() # Get execution time statistics string_io = io.StringIO() stats = pstats.Stats(profiler, stream=string_io) _ = stats.strip_dirs() _ = stats.sort_stats(\u0026#34;cumtime\u0026#34;) _ = stats.print_stats(20) print(dedent(string_io.getvalue()).strip()) if \u0026#34;memory\u0026#34; in include: # Get memory statistics current, peak = tracemalloc.get_traced_memory() print(f\u0026#34;Current memory usage: {current / 1024 / 1024:.2f} MB\u0026#34;) print(f\u0026#34;Peak memory usage: {peak / 1024 / 1024:.2f} MB\\n\u0026#34;) # Get top memory allocations snapshot = tracemalloc.take_snapshot() tracemalloc.stop() snapshot = snapshot.filter_traces( ( tracemalloc.Filter(False, \u0026#34;\u0026lt;frozen importlib._bootstrap\u0026gt;\u0026#34;), tracemalloc.Filter(False, \u0026#34;\u0026lt;unknown\u0026gt;\u0026#34;), ) ) top_stats = snapshot.statistics(\u0026#34;lineno\u0026#34;) print(\u0026#34;Top 10 memory-consuming lines:\u0026#34;) for index, stat in enumerate(top_stats[:10], 1): frame = stat.traceback[0] print( f\u0026#34;#{index}: {frame.filename}:{frame.lineno}: {stat.size / 1024:.1f} KiB\u0026#34; ) line = linecache.getline(frame.filename, frame.lineno).strip() if line: print(f\u0026#34; {line}\u0026#34;) other = top_stats[10:] if other: size = sum(stat.size for stat in other) print(f\u0026#34;{len(other)} other lines: {size / 1024:.1f} KiB\u0026#34;) total = sum(stat.size for stat in top_stats) print(f\u0026#34;Total allocated size: {total / 1024:.1f} KiB\u0026#34;) How It Works # The profile_code context manager works as follows:\nStart Profiling: When you enter the context manager, it initialises cProfile for time profiling and tracemalloc for memory profiling, based on the include parameter. Execute Your Code: The code inside the with block runs while being profiled. Stop Profiling and Print Results: When you exit the context manager, it stops profiling and prints: Execution time statistics (top 20 functions by total time). Memory usage statistics (current and peak memory usage, top 10 memory-consuming lines). Example Usage # Let’s use the profile_code context manager to profile a slow function:\ndef slow_function(duration): \u0026#34;\u0026#34;\u0026#34;A function that simulates a time-consuming task.\u0026#34;\u0026#34;\u0026#34; print(f\u0026#34;Running slow_function for {duration} seconds...\u0026#34;) time.sleep(duration) print(\u0026#34;slow_function finished.\u0026#34;) def fast_function(): \u0026#34;\u0026#34;\u0026#34;A function that performs a quick task.\u0026#34;\u0026#34;\u0026#34; print(\u0026#34;Running fast_function...\u0026#34;) total = 0 for i in range(10000): total += i print(\u0026#34;fast_function finished.\u0026#34;) def process_data(): \u0026#34;\u0026#34;\u0026#34;A function that calls other functions.\u0026#34;\u0026#34;\u0026#34; print(\u0026#34;Starting data processing...\u0026#34;) slow_function(2) for _ in range(3): fast_function() print(\u0026#34;Data processing finished.\u0026#34;) if __name__ == \u0026#39;__main__\u0026#39;: with profile_code(): process_data() This will output:\nThe top 20 functions by execution time. The current and peak memory usage. The top 10 lines where memory is allocated. Profiling Results # ============================================================ TIME \u0026amp; MEMORY PROFILING ------------------------------------------------------------ Starting data processing... Running slow_function for 2 seconds... slow_function finished. Running fast_function... fast_function finished. Running fast_function... fast_function finished. Running fast_function... fast_function finished. Data processing finished. 20 function calls in 2.038 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 2.038 2.038 test.py:103(process_data) 1 0.000 0.000 2.005 2.005 test.py:87(slow_function) 1 2.005 2.005 2.005 2.005 {built-in method time.sleep} 3 0.032 0.011 0.033 0.011 test.py:94(fast_function) 10 0.000 0.000 0.000 0.000 {built-in method builtins.print} 1 0.000 0.000 0.000 0.000 contextlib.py:145(__exit__) 1 0.000 0.000 0.000 0.000 {built-in method builtins.next} 1 0.000 0.000 0.000 0.000 test.py:12(profile_code) 1 0.000 0.000 0.000 0.000 {method \u0026#39;disable\u0026#39; of \u0026#39;_lsprof.Profiler\u0026#39; objects} Current memory usage: 0.02 MB Peak memory usage: 0.02 MB Top 10 memory-consuming lines: #1: /opt/homebrew/Cellar/python@3.13/3.13.11/Frameworks/Python.framework/Versions/3.13/lib/python3.13/pstats.py:230: 2.1 KiB fragment = fragment[:-1] #2: /opt/homebrew/Cellar/python@3.13/3.13.11/Frameworks/Python.framework/Versions/3.13/lib/python3.13/pstats.py:229: 1.5 KiB dict[fragment] = tup #3: /Users/toby/dev/projects/tobydevlin.com-3.0/test.py:32: 1.4 KiB profiler.enable() #4: /opt/homebrew/Cellar/python@3.13/3.13.11/Frameworks/Python.framework/Versions/3.13/lib/python3.13/pstats.py:264: 1.1 KiB stats_list.append((cc, nc, tt, ct) + func + #5: /opt/homebrew/Cellar/python@3.13/3.13.11/Frameworks/Python.framework/Versions/3.13/lib/python3.13/pstats.py:547: 1.1 KiB return os.path.basename(filename), line, name #6: /opt/homebrew/Cellar/python@3.13/3.13.11/Frameworks/Python.framework/Versions/3.13/lib/python3.13/pstats.py:289: 1.1 KiB newcallers[func_strip_path(func2)] = caller #7: /Users/toby/dev/projects/tobydevlin.com-3.0/test.py:46: 1.1 KiB print(dedent(string_io.getvalue()).strip()) #8: /opt/homebrew/Cellar/python@3.13/3.13.11/Frameworks/Python.framework/Versions/3.13/lib/python3.13/pstats.py:296: 1.0 KiB newstats[newfunc] = (cc, nc, tt, ct, newcallers) #9: /Users/toby/dev/projects/tobydevlin.com-3.0/test.py:38: 0.9 KiB profiler.disable() #10: /opt/homebrew/Cellar/python@3.13/3.13.11/Frameworks/Python.framework/Versions/3.13/lib/python3.13/cProfile.py:59: 0.8 KiB entries = self.getstats() 80 other lines: 9.2 KiB Total allocated size: 21.4 KiB The most important section is the table sorted by cumtime (cumulative time). This column shows the total time spent in a function, including all the functions it calls. It\u0026rsquo;s the best indicator of where your program is spending the most time overall.\nTop Bottleneck: Look at the first few lines. You can see that process_data is at the top, but the real workhorse of time consumption is slow_function, which directly calls time.sleep. The cumtime of 2.005 seconds for slow_function is almost entirely spent in the time.sleep call.\nFunction Calls: The ncalls column tells you how many times a function was called. fast_function was called 3 times, but its total time is negligible compared to slow_function.\nThe memory usage for this script is very low (0.02 MB). The \u0026ldquo;Top 10 memory-consuming lines\u0026rdquo; are mostly showing memory used by the profiler itself (pstats.py, cProfile.py), not the actual code. For this particular run, memory is not a concern.\nKey Takeaways # In this post, we explored how to profile Python code for time and memory usage. Here’s what you should remember:\nProfiling is essential: It helps you identify bottlenecks in your code so you can optimise the right parts. Use cProfile for time profiling: It measures execution time and function call statistics. Use tracemalloc for memory profiling: It tracks memory allocations and identifies memory leaks. The profile_code context manager: Combines both tools into a reusable utility for profiling time and memory in a single run. Always profile before optimising: Don’t guess where the bottlenecks are-let the data guide you. Try It Yourself! # Now that you know how to profile your Python code, it’s time to put it into practice:\nProfile your own code: Use the profile_code context manager above to identify bottlenecks in your projects. Experiment with optimisations: Try different approaches (e.g., list comprehensions, generator expressions) and measure the impact. Share your results: Let me know in the comments what you discovered-did profiling reveal any surprises? Happy profiling!\nResources # Python cProfile Documentation Python tracemalloc Documentation PEP 8: Style Guide for Python Code ","date":"2 February 2026","externalUrl":null,"permalink":"/blog/python-profiling-for-performance/","section":"Blog","summary":"","title":"How to Profile and Speed Up Your Python Code","type":"blog"},{"content":"","date":"2 February 2026","externalUrl":null,"permalink":"/tags/optimisation/","section":"Tags","summary":"","title":"Optimisation","type":"tags"},{"content":"","date":"2 February 2026","externalUrl":null,"permalink":"/tags/performance/","section":"Tags","summary":"","title":"Performance","type":"tags"},{"content":"","date":"2 February 2026","externalUrl":null,"permalink":"/series/python-standard-library/","section":"Series","summary":"","title":"Python Standard Library","type":"series"},{"content":"","date":"2 February 2026","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","date":"11 January 2026","externalUrl":null,"permalink":"/tags/cloud/","section":"Tags","summary":"","title":"Cloud","type":"tags"},{"content":"TLDR: Copy Paste the code in the appendix.\nEver wanted to know how to automate directions for long road trips or work out how long it would take to bike from all your property viewings to the station? Well, Google Sheets has a built in toold that you can write code to help you with that! (Seriosly google this would be mad powerful if it was built in) Using Apps Script you can write function which let you use the Maps API and produce distance, time and directions between two points.\nGoogle Sheets is surprisingly flexible, but it doesn\u0026rsquo;t have built-in \u0026ldquo;Map\u0026rdquo; formulas. However, by using Apps Script, we can create our own custom functions that talk directly to the Google Maps API. Whether you\u0026rsquo;re planning a massive road trip or calculating commute times for property viewings, this script will do the heavy lifting for you. The thing is this requires a bit of code\u0026hellip;\nSet up app AppScript # To get started, you don\u0026rsquo;t need to install anything.\nOpen your Google Sheet. Navigate to Extensions -\u0026gt; Apps Script. Delete any code in the editor and prepare to paste our solution. Set up base functions # The below code exposes the GetDircections function which is the main tool to use. We start by building a caching layer. Why? Because Google has daily quotas on API calls. If you have 500 rows and your sheet recalculates every time you edit a cell, you’ll hit those limits in minutes. Caching stores the result for 6 hours, making your sheet lightning-fast and quota-friendly.\n/** * GOOGLE MAPS CUSTOM FUNCTIONS FOR SHEETS by Toby Devlin * Includes Caching, Direction API logic, and Custom Formulas. */ // ============================================================= // SECTION 1: CACHING UTILITIES // ============================================================= // Handles storing and retrieving Maps data to avoid hitting API limits. const getCache = (key) =\u0026gt; { console.log(\u0026#34;cache lookup key: \u0026#34; + key); // https://developers.google.com/apps-script/reference/cache/cache?hl=en#get(String) const data = CacheService.getDocumentCache().get(key); if (data !== null) { console.log(\u0026#34;cache hit\u0026#34;); const value = JSON.parse(data); return value; } else { console.log(\u0026#34;cache miss\u0026#34;); return null; } }; const setCache = (key, value) =\u0026gt; { const data = JSON.stringify(value); console.log(\u0026#34;cache put key: \u0026#34; + key + \u0026#34; data: \u0026#34; + data); // 6 hours const expirationInSeconds = 6 * 60 * 60; // https://developers.google.com/apps-script/reference/cache/cache?hl=en#put(String,String,Integer) CacheService.getDocumentCache().put(key, data, expirationInSeconds); }; // ============================================================= // SECTION 2: CORE API LOGIC // ============================================================= // The engine that communicates with the Google Maps Direction Service. const getDirections = (origin, destination, mode = \u0026#34;driving\u0026#34;) =\u0026gt; { const key = [origin, destination, mode].join(\u0026#34;:\u0026#34;); // Is result in the internal cache? const cache_val = getCache(key); if (cache_val !== null) { return cache_val; } // docs: https://developers.google.com/apps-script/reference/maps/direction-finder const { routes: [data] = [] } = Maps.newDirectionFinder() .setOrigin(origin) .setDestination(destination) .setMode(mode) .getDirections(); if (!data) { throw new Error(\u0026#34;No route found!\u0026#34;); } setCache(key, data); return data; }; After saving this we have the hot-loop functon to call in every cell. The cache keeps things fast when re-calling the function wih the same keys and we have some rather hefty data being returned by the API - we will use this to get the distance and duration of the route in the derived transforms below.\nTransforms for Usability # Now that the engine is running, we need to expose it to the spreadsheet. We create \u0026ldquo;Custom Functions\u0026rdquo; that Google Sheets recognises.\n// ============================================================= // SECTION 3: PRIMARY CUSTOM FUNCTIONS // ============================================================= // Generic functions used directly in Google Sheets. /** * Calculate the distance between two locations. * @param {\u0026#34;Sydney, AUS\u0026#34;} origin The starting address or location. * @param {\u0026#34;Melbourne, AUS\u0026#34;} destination The ending address or location. * @param {\u0026#34;driving\u0026#34;} mode [OPTIONAL] The travel mode: \u0026#39;driving\u0026#39; (default), \u0026#39;walking\u0026#39;, \u0026#39;bicycling\u0026#39;, or \u0026#39;transit\u0026#39;. * @return {number} The distance in meters. * @customfunction */ const GOOGLEMAPS_DISTANCE = (origin, destination, mode = \u0026#34;driving\u0026#34;) =\u0026gt; { if (!(origin \u0026amp;\u0026amp; destination)) return; if (origin === destination) return 0; const data = getDirections(origin, destination, mode); const { legs: [{ distance: { value: distance_m } } = {}] = [] } = data; return distance_m; }; /** * Calculates the travel time between two locations as a formatted string. * * @param {\u0026#34;San Francisco, CA\u0026#34;} origin The starting address or location. * @param {\u0026#34;Los Angeles, CA\u0026#34;} destination The ending address or location. * @param {\u0026#34;driving\u0026#34;} mode [OPTIONAL] The travel mode: \u0026#39;driving\u0026#39; (default), \u0026#39;walking\u0026#39;, \u0026#39;bicycling\u0026#39;, or \u0026#39;transit\u0026#39;. * @return {string} The travel time (e.g., \u0026#34;1 hour 20 mins\u0026#34;). * @customfunction */ const GOOGLEMAPS_DURATION = (origin, destination, mode = \u0026#34;driving\u0026#34;) =\u0026gt; { if (!(origin \u0026amp;\u0026amp; destination)) return; if (origin === destination) return \u0026#34;0 hours 0 Mins\u0026#34;; const data = getDirections(origin, destination, mode); const { legs: [{ duration: { text: time } } = {}] = [] } = data; return time; }; /** * Calculates the travel time between two locations in raw seconds. * Useful for mathematical calculations or sorting. * * @param {\u0026#34;London, UK\u0026#34;} origin The starting address or location. * @param {\u0026#34;Paris, France\u0026#34;} destination The ending address or location. * @param {\u0026#34;driving\u0026#34;} mode [OPTIONAL] The travel mode: \u0026#39;driving\u0026#39; (default), \u0026#39;walking\u0026#39;, \u0026#39;bicycling\u0026#39;, or \u0026#39;transit\u0026#39;. * @return {number} The travel duration in total seconds. * @customfunction */ const GOOGLEMAPS_DURATION_SECONDS = (origin, destination, mode = \u0026#34;driving\u0026#34;) =\u0026gt; { if (!(origin \u0026amp;\u0026amp; destination)) return; if (origin === destination) return 0; const data = getDirections(origin, destination, mode); const { legs: [{ duration: { value: total_sec } } = {}] = [] } = data; return total_sec; }; After saving this code youll have these functions in the spreadsheet - and can use them as any builtin function like =GOOGLEMAPS_DURATION_SECONDS(...) in a cell. There are some much more useful partials we can build on top of this, however.\nPartials for verbosity # While =GOOGLEMAPS_DISTANCE(A1, B1, \u0026ldquo;walking\u0026rdquo;) works perfectly, it\u0026rsquo;s a bit of a mouthful. We can write \u0026ldquo;wrapper\u0026rdquo; functions to make our spreadsheet formulas much more descriptive. This section introduced partial functions which handle the mode and units for you.\n// ============================================================= // SECTION 4: MODE-SPECIFIC SHORTCUTS (DISTANCE) // ============================================================= /** * Calculates driving distance in meters. * @param {\u0026#34;Chicago, IL\u0026#34;} origin Starting point (e.g., Route 66 start). * @param {\u0026#34;Santa Monica, CA\u0026#34;} destination Ending point (e.g., Route 66 end). * @return {number} Distance in meters. * @customfunction */ const DRIVING_DISTANCE = (origin, destination) =\u0026gt; GOOGLEMAPS_DISTANCE(origin, destination, \u0026#34;driving\u0026#34;); /** * Calculates walking distance in meters. * @param {\u0026#34;The Louvre, Paris\u0026#34;} origin Starting point. * @param {\u0026#34;Eiffel Tower, Paris\u0026#34;} destination Ending point. * @return {number} Distance in meters. * @customfunction */ const WALKING_DISTANCE = (origin, destination) =\u0026gt; GOOGLEMAPS_DISTANCE(origin, destination, \u0026#34;walking\u0026#34;); ... Now you have can the following functions in your spreadsheet:\nBy bridging Google Sheets and Maps via Apps Script, we’ve transformed a static table into a dynamic logistics tool. We’ve also implemented caching to ensure our project remains performant and stays within Google\u0026rsquo;s free usage tiers.\nTry pasting the code below into your next project for full coverage.! If you run into any \u0026ldquo;No route found\u0026rdquo; errors, double-check that your addresses are formatted correctly for Google Maps.\nAppendix # TLDR: Copy this code into your Apps Script project to get started ASAP.\n/** * GOOGLE MAPS CUSTOM FUNCTIONS FOR SHEETS by Toby Devlin * https://tobydevlin.com/blog/google-maps-in-google-sheets/ * Includes Caching, Direction API logic, and Custom Formulas. */ // ============================================================= // SECTION 1: CACHING UTILITIES // ============================================================= // Handles storing and retrieving Maps data to avoid hitting API limits. const getCache = (key) =\u0026gt; { console.log(\u0026#34;cache lookup key: \u0026#34; + key); const data = CacheService.getDocumentCache().get(key); if (data !== null) { console.log(\u0026#34;cache hit\u0026#34;); const value = JSON.parse(data); return value; } else { console.log(\u0026#34;cache miss\u0026#34;); return null; } }; const setCache = (key, value) =\u0026gt; { const data = JSON.stringify(value); console.log(\u0026#34;cache put key: \u0026#34; + key + \u0026#34; data: \u0026#34; + data); // Store the results for 18 hours (max allowed is 21600 seconds/6 hours, // but document cache may vary-adjust if you hit limits). const expirationInSeconds = 18 * 60 * 60; CacheService.getDocumentCache().put(key, data, expirationInSeconds); }; // ============================================================= // SECTION 2: CORE API LOGIC // ============================================================= // The engine that communicates with the Google Maps Direction Service. const getDirections = (origin, destination, mode = \u0026#39;driving\u0026#39;) =\u0026gt; { const key = [origin, destination, mode].join(\u0026#34;:\u0026#34;); // Is result in the internal cache? const cache_val = getCache(key); if (cache_val !== null) { return cache_val; } // docs: https://developers.google.com/apps-script/reference/maps/direction-finder const { routes: [data] = [] } = Maps.newDirectionFinder() .setOrigin(origin) .setDestination(destination) .setMode(mode) .getDirections(); if (!data) { throw new Error(\u0026#39;No route found!\u0026#39;); } setCache(key, data); return data; }; // ============================================================= // SECTION 3: PRIMARY CUSTOM FUNCTIONS // ============================================================= // Generic functions used directly in Google Sheets. /** * Calculate the distance between two locations. * @param {\u0026#34;New York, NY\u0026#34;} origin The starting address or location. * @param {\u0026#34;Boston, MA\u0026#34;} destination The ending address or location. * @param {\u0026#34;driving\u0026#34;} mode [OPTIONAL] The travel mode: \u0026#39;driving\u0026#39; (default), \u0026#39;walking\u0026#39;, \u0026#39;bicycling\u0026#39;, or \u0026#39;transit\u0026#39;. * @return {number} The distance in meters. * @customfunction * @customFunction */ const GOOGLEMAPS_DISTANCE = (origin, destination, mode = \u0026#39;driving\u0026#39;) =\u0026gt; { if (!(origin \u0026amp;\u0026amp; destination)) return; if (origin === destination) return 0; const data = getDirections(origin, destination, mode); const { legs: [{ distance: { value: distance_m } } = {}] = [] } = data; return distance_m; }; /** * Calculates the travel time between two locations as a formatted string. * * @param {\u0026#34;San Francisco, CA\u0026#34;} origin The starting address or location. * @param {\u0026#34;Los Angeles, CA\u0026#34;} destination The ending address or location. * @param {\u0026#34;driving\u0026#34;} mode [OPTIONAL] The travel mode: \u0026#39;driving\u0026#39; (default), \u0026#39;walking\u0026#39;, \u0026#39;bicycling\u0026#39;, or \u0026#39;transit\u0026#39;. * @return {string} The travel time (e.g., \u0026#34;1 hour 20 mins\u0026#34;). * @customfunction */ const GOOGLEMAPS_DURATION = (origin, destination, mode = \u0026#34;driving\u0026#34;) =\u0026gt; { if (!(origin \u0026amp;\u0026amp; destination)) return; if (origin === destination) return \u0026#34;0 hours 0 Mins\u0026#34;; const data = getDirections(origin, destination, mode); const { legs: [{ duration: { text: time } } = {}] = [] } = data; return time; }; /** * Calculates the travel time between two locations in raw seconds. * Useful for mathematical calculations or sorting. * * @param {\u0026#34;London, UK\u0026#34;} origin The starting address or location. * @param {\u0026#34;Paris, France\u0026#34;} destination The ending address or location. * @param {\u0026#34;driving\u0026#34;} mode [OPTIONAL] The travel mode: \u0026#39;driving\u0026#39; (default), \u0026#39;walking\u0026#39;, \u0026#39;bicycling\u0026#39;, or \u0026#39;transit\u0026#39;. * @return {number} The travel duration in total seconds. * @customfunction */ const GOOGLEMAPS_DURATION_SECONDS = (origin, destination, mode = \u0026#34;driving\u0026#34;) =\u0026gt; { if (!(origin \u0026amp;\u0026amp; destination)) return; if (origin === destination) return 0; const data = getDirections(origin, destination, mode); const { legs: [{ duration: { value: total_sec } } = {}] = [] } = data; return total_sec; }; // ============================================================= // SECTION 4: MODE-SPECIFIC SHORTCUTS (DISTANCE) // ============================================================= /** * Calculates driving distance in meters. * @param {\u0026#34;Chicago, IL\u0026#34;} origin Starting point (e.g., Route 66 start). * @param {\u0026#34;Santa Monica, CA\u0026#34;} destination Ending point (e.g., Route 66 end). * @return {number} Distance in meters. * @customfunction */ const DRIVING_DISTANCE = (origin, destination) =\u0026gt; GOOGLEMAPS_DISTANCE(origin, destination, \u0026#34;driving\u0026#34;); /** * Calculates walking distance in meters. * @param {\u0026#34;The Louvre, Paris\u0026#34;} origin Starting point. * @param {\u0026#34;Eiffel Tower, Paris\u0026#34;} destination Ending point. * @return {number} Distance in meters. * @customfunction */ const WALKING_DISTANCE = (origin, destination) =\u0026gt; GOOGLEMAPS_DISTANCE(origin, destination, \u0026#34;walking\u0026#34;); /** * Calculates bicycling distance in meters. * @param {\u0026#34;Vondelpark, Amsterdam\u0026#34;} origin Starting point. * @param {\u0026#34;Utrecht Central Station\u0026#34;} destination Ending point. * @return {number} Distance in meters. * @customfunction */ const BICYCLING_DISTANCE = (origin, destination) =\u0026gt; GOOGLEMAPS_DISTANCE(origin, destination, \u0026#34;bicycling\u0026#34;); /** * Calculates transit distance in meters. * @param {\u0026#34;Times Square, NY\u0026#34;} origin Starting point. * @param {\u0026#34;Yankee Stadium, Bronx\u0026#34;} destination Ending point. * @return {number} Distance in meters. * @customfunction */ const TRANSIT_DISTANCE = (origin, destination) =\u0026gt; GOOGLEMAPS_DISTANCE(origin, destination, \u0026#34;transit\u0026#34;); // ============================================================= // SECTION 5: MODE-SPECIFIC SHORTCUTS (DURATION STRING) // ============================================================= /** * Calculates driving time (e.g., \u0026#34;3 days 4 hours\u0026#34;). * @param {\u0026#34;Seattle, WA\u0026#34;} origin Starting point. * @param {\u0026#34;Miami, FL\u0026#34;} destination Ending point. * @return {string} Formatted duration. * @customfunction */ const DRIVING_DURATION = (origin, destination) =\u0026gt; GOOGLEMAPS_DURATION(origin, destination, \u0026#34;driving\u0026#34;); /** * Calculates walking time (e.g., \u0026#34;45 mins\u0026#34;). * @param {\u0026#34;Grand Canyon South Rim\u0026#34;} origin Starting point. * @param {\u0026#34;Grand Canyon North Rim\u0026#34;} destination Ending point. * @return {string} Formatted duration. * @customfunction */ const WALKING_DURATION = (origin, destination) =\u0026gt; GOOGLEMAPS_DURATION(origin, destination, \u0026#34;walking\u0026#34;); /** * Calculates bicycling time (e.g., \u0026#34;15 mins\u0026#34;). * @param {\u0026#34;Brooklyn Bridge\u0026#34;} origin Starting point. * @param {\u0026#34;Central Park, NY\u0026#34;} destination Ending point. * @return {string} Formatted duration. * @customfunction */ const BICYCLING_DURATION = (origin, destination) =\u0026gt; GOOGLEMAPS_DURATION(origin, destination, \u0026#34;bicycling\u0026#34;); /** * Calculates transit time (e.g., \u0026#34;1 hour 10 mins\u0026#34;). * @param {\u0026#34;Heathrow Airport\u0026#34;} origin Starting point. * @param {\u0026#34;Buckingham Palace\u0026#34;} destination Ending point. * @return {string} Formatted duration. * @customfunction */ const TRANSIT_DURATION = (origin, destination) =\u0026gt; GOOGLEMAPS_DURATION(origin, destination, \u0026#34;transit\u0026#34;); // ============================================================= // SECTION 6: MODE-SPECIFIC SHORTCUTS (DURATION SECONDS) // ============================================================= /** * Calculates driving time in total seconds. * @param {\u0026#34;Las Vegas Strip\u0026#34;} origin Starting point. * @param {\u0026#34;Area 51, NV\u0026#34;} destination Ending point. * @return {number} Duration in seconds. * @customfunction */ const DRIVING_DURATION_SECONDS = (origin, destination) =\u0026gt; GOOGLEMAPS_DURATION_SECONDS(origin, destination, \u0026#34;driving\u0026#34;); /** * Calculates walking time in total seconds. * @param {\u0026#34;The Great Pyramids of Giza\u0026#34;} origin Starting point. * @param {\u0026#34;The Great Sphinx\u0026#34;} destination Ending point. * @return {number} Duration in seconds. * @customfunction */ const WALKING_DURATION_SECONDS = (origin, destination) =\u0026gt; GOOGLEMAPS_DURATION_SECONDS(origin, destination, \u0026#34;walking\u0026#34;); /** * Calculates bicycling time in total seconds. * @param {\u0026#34;Copenhagen, Denmark\u0026#34;} origin Starting point. * @param {\u0026#34;Malmö, Sweden\u0026#34;} destination Ending point. * @return {number} Duration in seconds. * @customfunction */ const BICYCLING_DURATION_SECONDS = (origin, destination) =\u0026gt; GOOGLEMAPS_DURATION_SECONDS(origin, destination, \u0026#34;bicycling\u0026#34;); /** * Calculates transit time in total seconds. * @param {\u0026#34;Sydney Opera House\u0026#34;} origin Starting point. * @param {\u0026#34;Bondi Beach\u0026#34;} destination Ending point. * @return {number} Duration in seconds. * @customfunction */ const TRANSIT_DURATION_SECONDS = (origin, destination) =\u0026gt; GOOGLEMAPS_DURATION_SECONDS(origin, destination, \u0026#34;transit\u0026#34;); ","date":"11 January 2026","externalUrl":null,"permalink":"/blog/google-maps-in-google-sheets/","section":"Blog","summary":"","title":"How to use Maps in Google Sheets","type":"blog"},{"content":"","date":"2 January 2026","externalUrl":null,"permalink":"/tags/aws/","section":"Tags","summary":"","title":"Aws","type":"tags"},{"content":"","date":"2 January 2026","externalUrl":null,"permalink":"/tags/devops/","section":"Tags","summary":"","title":"Devops","type":"tags"},{"content":"Need a complete inventory of all AWS resources in your account? AWS doesn’t provide a built-in way to view all resources across all regions at once, but this script makes it easy.\nHow to Run the Script # 1. Save the Script # Copy the script below and save it as list_aws_resources.py:\n2. Install Dependencies # Install Boto3 manually with pip install boto3, probably in in a virtualenv.\nOr use the script’s built-in dependency installer, if you\u0026rsquo;re using pdm, uv or pipx\nuv run list_aws_resources.py 3. Run the Script # Basic Usage # List all resources in your default AWS profile:\nuv run list_aws_resources.py Using a Specific AWS Profile # Specify a profile with the --profile flag:\nuv run list_aws_resources.py --profile my-profile-name Example Output # The script prints the ARN of every resource, organized by region:\nUsing profile: \u0026#39;default\u0026#39; Using account id: \u0026#39;1234567890\u0026#39; Using role/user: \u0026#39;arn:aws:sts::1234567890:assumed-role/AWSReservedSSO_AdministratorAccess_abcdefg12345/admin\u0026#39; ---------- af-south-1 Could not connect to region with error: An error occurred (UnrecognizedClientException) when calling the GetResources operation: The security token included in the request is invalid eu-central-1 [] eu-west-1 [\u0026#39;arn:aws:acm:eu-west-1:1234567890:certificate/eoks8513-f930-4e56-8021-8famla432957\u0026#39;, \u0026#39;arn:aws:dynamodb:eu-west-1:1234567890:table/users\u0026#39;, \u0026#39;arn:aws:dynamodb:eu-west-1:1234567890:table/application-data\u0026#39;, \u0026#39;arn:aws:dynamodb:eu-west-1:1234567890:table/example-table\u0026#39;, ... ] ... What the Script Does # Authenticates with AWS using your default or specified profile. Fetches all AWS regions available to your account. Queries the Resource Groups Tagging API in each region to list all resources. Prints the ARN of each resource to the console. Key Notes # Only lists resources supported by the Resource Groups Tagging API. Some services (like IAM) may not appear. See AWS’s documentation for details. Skips regions you don’t have access to. Does not modify or delete any resources. Common Use Cases # Auditing Your Account - Identify unused or orphaned resources (e.g., old S3 buckets, unused EC2 instances). Debugging Issues - Locate resources when you know the name but not the region. Migrating Resources - Verify all resources in a region/account before or after migration. Troubleshooting # \u0026quot;User is not authorized to perform: tag:GetResources\u0026quot; - Attach the ResourceGroupsTaggingAPIReadOnlyAccess policy to your IAM user/role. \u0026quot;Could not connect to region\u0026quot; - Your account may not have access to that region. The script will skip it and continue. No Resources Found - Double-check your AWS profile and permissions for the user/role. Verify the Resource Groups Tagging API supports the services you’re using. Next Steps # Export results to a file: python list_aws_resources.py \u0026gt; aws_resources.txt Filter results by resource type (e.g., only S3 buckets). Automate audits by scheduling the script (e.g., using cron or AWS Lambda). Conclusion # This script is a quick and easy way to list all AWS resources in your account. Use it for audits, debugging, or migrations-no complex setup required!\nKey Takeaways # Uses the Resource Groups Tagging API to list resources across all regions. Run it with one command. Attach the ResourceGroupsTaggingAPIReadOnlyAccess policy if you hit permissions errors. Give it a try and see what you\u0026rsquo;re missing! ☁️\nNotes # This script makes use of the python inline metadata format, which i have written another post about below.\nPython Inline Script Metadata Format 271 words\u0026middot;2 mins Python Code ","date":"2 January 2026","externalUrl":null,"permalink":"/blog/list-all-aws-resources/","section":"Blog","summary":"","title":"How to List All AWS Resources in an Account","type":"blog"},{"content":"Let\u0026rsquo;s face it: the default shell experience is\u0026hellip; fine. But we\u0026rsquo;re not here for fine, we\u0026rsquo;re here for ✨Pretty✨. We\u0026rsquo;re talking about a shell that\u0026rsquo;s as functional as you are (probably) and makes you feel like a proper movie hacker every time you open a terminal.\nHere\u0026rsquo;s what we\u0026rsquo;re gonna do: transform your terminal from \u0026ldquo;meh\u0026rdquo; to \u0026ldquo;yeah!\u0026rdquo; with some choice tools that\u0026rsquo;ll make your command line experience smooth, beautiful, and genuinely productive.\nWorks on my Machine™️ # This guide works with Linux-like platforms, including:\nWindows WSL (Windows Subsystem for Linux) Debian-based distros (Ubuntu, Pop!_OS, Mint, etc.) macOS Most of these tools are cross-platform, so you can maintain a consistent experience across all your machines. Managing your dotfiles cleanly is left as an exercise to the reader\u0026hellip;\n1. Fish 🐟 # Fish (Friendly Interactive SHell) comes with sensible defaults, autosuggestions out of the box, and syntax highlighting that actually helps you catch errors before you hit enter. No plugins needed-it just works.\nInstallation guide\nOnce installed, make it your default shell (remember to read the docs first):\nchsh -s $(which fish) Note: Fish syntax is slightly different from bash. Commands like export VAR=value become set -x VAR value. Don\u0026rsquo;t worry, the improved autocomplete makes this transition painless and many tools, such as direnv, will still work with the shell script syntax.\n2. Starship ⭐ # Starship is a minimal, blazing-fast, and infinitely customizable prompt that works with any shell. It shows you context-aware information like git status, language versions, and directory paths-all beautifully formatted.\nInstallation guide\nAdd to your Fish config (~/.config/fish/config.fish):\nstarship init fish | source Starship uses a config file at ~/.config/starship.toml. Check out the configuration docs to customize everything-colors, symbols, what information shows up and when.\n3. Atuin 🔍 # Atuin replaces your shell history with a SQLite database that\u0026rsquo;s searchable, syncable across machines, and context-aware. Never lose a command again.\nInstallation guide\nAdd to ~/.config/fish/config.fish:\natuin init fish | source Press Ctrl+R or hit up and start typing. Atuin will fuzzy-search through your entire command history with a beautiful TUI interface. You can filter by directory, exit code, time range, and sessions.\nWant history that follows you everywhere? Register an account and enable sync-your commands sync across all your machines.\n4. Terminal UIs # Terminal User Interfaces (TUIs) give you powerful, keyboard-driven tools without leaving the command line. Here are two essential ones that\u0026rsquo;ll transform how you work.\nLazyGit \u0026amp; LazyDocker 🌿🐳 # Jesse Duffield created both of these tools with the same philosophy: powerful terminal UIs that are keyboard-driven and intuitive. If you like one, you\u0026rsquo;ll love the other.\nLazyGit is a terminal UI for git commands that makes complex git workflows feel effortless.\nLazyGit installation guide\nNavigate to any git repo and run:\nlazygit You\u0026rsquo;ll see a gorgeous interface showing files, branches, commits, and stash. Use arrow keys to navigate, press ? for help. Press P to push, p to pull, c to commit - git operations that normally take 5 commands become a single keystroke.\n⚠️ Warning: These will make you realize how much time you\u0026rsquo;ve been wasting with raw git commands. Side effects may include: refusing to use git from the command line ever again, judging coworkers who still type git add ., and finally understanding what a rebase actually does.\nLazyDocker brings the same experience to Docker. Manage containers, images, volumes, and networks without memorising a thousand Docker commands.\nLazyDocker installation guide\nRun it with:\nlazydocker Navigate containers, view logs in real-time, inspect volumes, and prune images-all from a clean keyboard-driven interface.\nWarning: These will make you realise how much time you\u0026rsquo;ve been wasting with raw git and Docker commands.\nbtop 📊 # btop is like top and htop had a baby, and that baby went to design school. It\u0026rsquo;s beautiful, resource-friendly, and packed with information.\nInstallation guide\nRun it with:\nbtop You\u0026rsquo;ll see CPU usage per core with gorgeous graphs, memory usage, disk I/O, network traffic, and a sortable process list. Press Esc to quit, or just leave it running in a tmux pane because it looks cool.\nPro Tip: Leave btop running in a terminal when your manager walks by. Instant +50 to perceived productivity.\nYou\u0026rsquo;re Basically a 10x Developer Now # Hey presto, you\u0026rsquo;ve got yourself an l337 h4x0r terminal! 🎉 Hopefully, this will reduce your cognitive load - speeding up your workflow and making the terminal a place you actually want to be.\nTweak configs, explore additional plugins, and make this setup your own. The beauty of the command line is that it\u0026rsquo;s infinitely customizable - these tools just give you a solid foundation to build on.\n","date":"2 December 2025","externalUrl":null,"permalink":"/blog/l337-h4x0r-terminal/","section":"Blog","summary":"","title":"Making a L337 H4x0R Terminal","type":"blog"},{"content":"","date":"2 December 2025","externalUrl":null,"permalink":"/tags/terminal/","section":"Tags","summary":"","title":"Terminal","type":"tags"},{"content":"Modern databases seem like magic-until you realize they\u0026rsquo;re just organized files and clever processes. Let\u0026rsquo;s cut through the mystique by building a key-value database in bash. Spoiler: it\u0026rsquo;s shockingly simple\u0026hellip; but impractical.\nHere\u0026rsquo;s What We\u0026rsquo;re Gonna Do # We\u0026rsquo;ll create four bash functions that handle CRUD operations (Create, Read, Update, Delete). No frameworks, no dependencies-just pure shell magic.\n# Initialize our \u0026#34;database\u0026#34; mkdir -p ~/.database Heres the real magic:\n# Create/update a key-value pair dbset() { echo \u0026#34;$2\u0026#34; \u0026gt; ~/.database/\u0026#34;$1\u0026#34;; } # Read a value by key dbget() { cat ~/.database/\u0026#34;$1\u0026#34;; } # Delete a key dbdelete() { rm ~/.database/\u0026#34;$1\u0026#34;; } # Search for keys by pattern dbscan() { ls -l ~/.database | grep \u0026#34;$1\u0026#34;; } Boom. You\u0026rsquo;ve just deployed a database. Let\u0026rsquo;s break it down.\n1. How It Works # File-Based Storage # Every \u0026ldquo;key\u0026rdquo; is a file in ~/.database, and its \u0026ldquo;value\u0026rdquo; is the file\u0026rsquo;s content. Writing to a key? Just echo text into a file. Reading? Cat the file.\nAtomic Operations # Each command runs as a single filesystem operation-no half-written data.\n2. ACID Compliance: The Delusional Developer\u0026rsquo;s Checklist # Let\u0026rsquo;s pretend we\u0026rsquo;re enterprise-grade! Here\u0026rsquo;s how our bash DB technically checks ACID boxes:\nAtomicity\nKinda. Each dbset or dbdelete is a single filesystem operation. Your whole value writes or fails-no partial updates. Consistency\nSure, if you squint. Files either exist with valid content or don\u0026rsquo;t. No constraints? No problem! Isolation\nSingle-player mode. The OS handles file locks, but concurrency? Let’s not talk about concurrency. Durability\nAs durable as your disk. If your SSD survives, so does your data. Warning: This is a joke. Real databases handle ACID with transactions, locks, and recovery mechanisms. Don\u0026rsquo;t use this for anything serious.\n3. Bonus Round: Hooks and Permissions # Trigger Happy # Want to run code when a key changes? Use inotify-tools to watch the directory:\n# Watch for file changes in real-time inotifywait -m ~/.database | while read event; do echo \u0026#34;Database updated! Event: $event\u0026#34; done Security Features # Permissions? Just use UNIX file permissions. Want to restrict access to a key?\nchmod 600 ~/.database/my_secret_key # Only you can read/write 4. Why This Is Terrifying # No backups No transactions No scalability No error handling (What if two processes write at once?) But hey-it\u0026rsquo;s 4 lines of code.\nConclusion: A Glorious Toy # Congratulations! You\u0026rsquo;ve built a \u0026ldquo;database\u0026rdquo; that\u0026rsquo;s both awe-inspiring and horrifying. While this isn\u0026rsquo;t replacing PostgreSQL anytime soon, it demystifies core database concepts:\nData is just organized bytes \u0026ldquo;Queries\u0026rdquo; are operations on those bytes Permissions and hooks are OS-level features Go forth and experiment - then use a real database for your next project.\nFinal Warning: If you deploy this, you\u0026rsquo;ll anger the ops gods. You\u0026rsquo;ve been warned.\n","date":"23 April 2025","externalUrl":null,"permalink":"/blog/database-engineers-hate-this-one-weird-trick/","section":"Blog","summary":"","title":"Database engineers hate this one weird trick","type":"blog"},{"content":"","date":"23 April 2025","externalUrl":null,"permalink":"/tags/fun/","section":"Tags","summary":"","title":"Fun","type":"tags"},{"content":"Ever wished you could define dependencies right inside your quick pythons scripts without wrestling with complex configuration files? Hey presto, Python\u0026rsquo;s got a slick feature that gets you one step closer to that 10x engineer.\nWhat\u0026rsquo;s the Magic? # Python now supports inline metadata that tools like uv can read and manage in virtual environments. It\u0026rsquo;s modern, scalable, and ridiculously simple to use.\nThe Inline Metadata Format # Check out this super-clean example that defines Python version requirements and dependencies metadata right in the script file itself:\n# /// script # requires-python = \u0026#34;\u0026lt;3.11\u0026#34; # dependencies = [ # \u0026#34;requests\u0026lt;3.0.2\u0026#34;, # ] # /// import platform import sys import requests def get_cute_doggo(): res = requests.get(\u0026#34;https://dog.ceo/api/breed/frise/bichon/images/random\u0026#34;) res.raise_for_status() return res.json().get(\u0026#34;message\u0026#34;) print(get_cute_doggo()) print(sys.version_info) Note: This isn\u0026rsquo;t just syntactic sugar - it\u0026rsquo;s a powerful way to make your scripts totally self-contained!\nHow It Works in Action # Here\u0026rsquo;s what happens when you run the script with uv:\n[25-01-24 08:12:56] toby@aws_tmp_aunx17s/:~/dev λ: uv run ./my_script.py Reading inline script metadata from `my_script.py` {\u0026#34;message\u0026#34;:\u0026#34;https:\\/\\/images.dog.ceo\\/breeds\\/frise-bichon\\/5.webp\u0026#34;,\u0026#34;status\u0026#34;:\u0026#34;success\u0026#34;} https://images.dog.ceo/breeds/frise-bichon/5.webp sys.version_info(major=3, minor=10, micro=15, releaselevel=\u0026#39;final\u0026#39;, serial=0) Breaking it Down:\nrequires-python = \u0026quot;\u0026gt;=3.11\u0026quot; ensures the script runs on Python 3.10 or earlier. dependencies list specifies exactly which packages (and versions) you need uv automatically creates a virtual environment with these exact requirements Why This Rocks # Imagine never again dealing with separate requirements.txt files or complex pyproject.toml configurations for simple things. With inline metadata, your dependencies travel right alongside your code - clean, simple, and straightforward.\nTry changing the example dependencies or versions, see what happens!\nHey presto! You\u0026rsquo;re now equipped with one of Python\u0026rsquo;s coolest features. Go forth and simplify your dev_scripts folder! 🐍✨\n","date":"24 January 2025","externalUrl":null,"permalink":"/blog/python-inline-script-metadata-format/","section":"Blog","summary":"","title":"Python Inline Script Metadata Format","type":"blog"},{"content":"","date":"20 January 2025","externalUrl":null,"permalink":"/tags/stdlib/","section":"Tags","summary":"","title":"Stdlib","type":"tags"},{"content":"Welcome to the second part of my advanced Python series, where we explore the hidden gems within Python\u0026rsquo;s standard library. In this post, we\u0026rsquo;ll delve into the deque data structure, a versatile tool that can supercharge your algorithms and applications.\nWhat is a Deque? # A deque, short for \u0026ldquo;double-ended queue,\u0026rdquo; is a data structure that allows efficient insertion and removal of elements from both ends. This flexibility makes it a powerful tool for various algorithms and scenarios, offering a generalization of both stacks and queues.\nDeque as a Stack # Stacks are simple enough to implement with a list, but using a deque is more efficient. As per the docs:\nThough list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.\nHence using a deque will incur an o(1) cost for popping and appending and should result in faster code.\nfrom collections import deque stack = deque() # push uses .append() stack.append(1) stack.append(2) stack.append(3) stack.append(4) stack.append(5) # peak uses index at the \u0026#34;top\u0026#34; print(stack[-1]) # Output: 5 # pop uses .pop() - duh print(stack.pop()) # Output: 5 print(stack.pop()) # Output: 4 # elements are removed from the end print(stack) # Output: deque([1, 2, 3]) Deque as a Queue # Again, using a deque is more efficient than using a list in this apporach an has some nifty methods to enable the .pop() at o(1) also from the start of the list.\nfrom collections import deque queue = deque() # append uses append queue.append(1) queue.append(2) queue.append(3) queue.append(4) queue.append(5) # peak uses index at the \u0026#34;start\u0026#34; print(queue[0]) # Output: 1 # pop uses popleft print(queue.popleft()) # Output: 1 print(queue.popleft()) # Output: 2 # elements are removed from the start print(queue) # Output: deque([3, 4, 5]) Tailing and Iterables # One common application of deques is in tailing iterables. By utilizing a deque, you can efficiently record the tail of an iterable, adding elements to the back and removing them from the front as needed. This is particularly useful when you want to keep a fixed-size window of the most recent elements.\nfrom collections import deque def tail(iter, n=10): return deque(iter, n) my_iter = range(4, 56, 2) print(tail(my_iter, 3)) # Output: deque([10, 12, 14]) In the above example, the tail function takes an iterable and a window size n as arguments. It returns a deque containing the last n elements of the iterable.\nUndo Operations and History Management # Deques are also invaluable for implementing undo operations in software applications. By using a deque as a stack, you can store a history of actions, allowing users to revert to previous states by removing actions from the back of the deque.\nclass History: def __init__(self): self.undo_stack = deque() def perform_action(self, action): self.undo_stack.append(action) def undo(self): \u0026#34;\u0026#34;\u0026#34;Undoes the last action.\u0026#34;\u0026#34;\u0026#34; if self.undo_stack: action = self.undo_stack.pop() print(f\u0026#34;Undone: {action}\u0026#34;) else: print(\u0026#34;No action to undo.\u0026#34;) history = History() history.perform_action(\u0026#34;Open File\u0026#34;) history.perform_action(\u0026#34;Edit File\u0026#34;) history.undo() # Output: Undone: Edit File history.perform_action(\u0026#34;Save File\u0026#34;) history.undo() # Output: Undone: Save File In this example, the History class maintains an undo stack using a deque. The perform_action method adds actions to the stack, while the undo method removes the last action, effectively undoing it.\n\u0026ldquo;Play Next\u0026rdquo; Music Queue # A creative application of deques is in managing a \u0026ldquo;play next\u0026rdquo; music queue. In this scenario, songs are added to the back of the deque to be played in order. However, a \u0026ldquo;play next\u0026rdquo; feature allows users to add songs to the front of the deque, ensuring they are played immediately after the current song.\nfrom collections import deque class MusicPlayer: play_queue: deque = deque() playing: str | None = None def add(self, song: str): self.play_queue.append(song) def add_next(self, song: str): self.play_queue.appendleft(song) def play_next_song(self): if self.play_queue: self.playing = self.play_queue.popleft() else: self.playing = None def __str__(self) -\u0026gt; str: return f\u0026#34;Now playing: {self.playing}, Queue: {self.play_queue}\u0026#34; player = MusicPlayer() # Add songs to the queue player.add(\u0026#34;Song 1\u0026#34;) player.add(\u0026#34;Song 2\u0026#34;) player.add(\u0026#34;Song 3\u0026#34;) print(player) # Output: Now playing: None, Queue: [\u0026#39;Song 1\u0026#39;, \u0026#39;Song 2\u0026#39;, \u0026#39;Song 3\u0026#39;] player.play_next_song() print(player) # Output: Now playing: Song 1, Queue: [\u0026#39;Song 2\u0026#39;, \u0026#39;Song 3\u0026#39;] player.add_next(\u0026#34;Song 4\u0026#34;) print(player) # Output: Now playing: Song 1, Queue: [\u0026#39;Song 4\u0026#39;, \u0026#39;Song 2\u0026#39;, \u0026#39;Song 3\u0026#39;] player.play_next_song() print(player) # Output: Now playing: Song 4, Queue: [\u0026#39;Song 2\u0026#39;, \u0026#39;Song 3\u0026#39;] In this example, the MusicPlayer class uses a deque to manage the music queue. The add method adds songs to the back of the deque, while the add_next method adds songs to the front, ensuring they are played next. The play_next_song method advances the queue, updating the currently playing song.\nConclusion # Deques are a powerful and versatile data structure in Python, offering efficient solutions for a wide range of problems. By understanding and leveraging their capabilities, you can write more elegant and optimized code, especially in scenarios where you need to manage data from both ends of a collection.\nStay tuned for the final part of our advanced Python series, where we\u0026rsquo;ll explore another exciting module and its applications!\nHappy coding!\n","date":"20 January 2025","externalUrl":null,"permalink":"/blog/python-supercharge-lists-deque/","section":"Blog","summary":"","title":"Supercharge lists in Python: deque","type":"blog"},{"content":"Welcome to the first part of my advanced Python series! In this blog post, we\u0026rsquo;ll dive into the bisect module, a powerful tool in Python\u0026rsquo;s standard library that can supercharge your list manipulation skills.\nWhat is Bisection? # Bisection is an algorithm that employs a simple yet effective strategy: \u0026ldquo;cut the thing in half until you find the right point.\u0026rdquo; While it may sound basic, its applications are far-reaching and can greatly enhance your coding toolkit, especially when dealing with sorted lists.\nSome practical uses of bisection include:\nEfficiently Sorting Lists: Keeping a list sorted while adding new items. Binary Search: A classic algorithm for finding elements in a sorted list. Merging Sorted Items: Combining multiple sorted lists into one. These foundational algorithms can be further leveraged in more complex, real-world scenarios, such as the \u0026ldquo;Merging Sorted Lists\u0026rdquo; problem on LeetCode.\nThe bisect Module in Action # The bisect module in Python provides a range of functions to implement bisection algorithms efficiently. Let\u0026rsquo;s explore one of its key functions, bisect.insort(), through a practical example.\nimport bisect # Test cases for merging sorted lists test_cases = [ ([[1, 4, 5], [1, 3, 4], [2, 6]], [1, 1, 2, 3, 4, 4, 5, 6]), ([[3, 4, 5, 6, 7], [0, 1, 2], [8, 9]], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), # ... (more test cases) ] def mergeKLists(lists: list[list[int]]) -\u0026gt; list[int]: s = [] for _list in lists: for _el in _list: bisect.insort(s, _el) return s # Run tests for _in, _out in test_cases: assert mergeKLists(_in) == _out In the above code, we define a mergeKLists function that takes a list of sorted lists (lists) and returns a single merged, sorted list. The bisect.insort() function is used to efficiently insert elements into the s list while maintaining its sorted order.\nExploring bisect.bisect() # One of the most powerful functions in the bisect module is bisect.bisect(). Let\u0026rsquo;s take a closer look at its documentation:\nReturn the index where to insert item x in list a, assuming a is sorted.The return value i is such that all e in a[:i] have e \u0026lt;= x, and all e in a[i:] have e \u0026gt; x. So if x already appears in the list, a.insert(i, x) will insert just after the rightmost x already there.Optional args lo (default 0) and hi (default len(a)) bound the slice of a to be searched.A custom key function can be supplied to customize the sort order.\nThis function is particularly useful in divide-and-conquer algorithms, allowing you to efficiently find the correct position to insert an element into a sorted list.\nConclusion # The bisect module is a hidden gem in Python\u0026rsquo;s standard library, offering a range of powerful tools for list manipulation. By leveraging bisection algorithms, you can write more efficient and elegant code, especially when working with sorted data.\nStay tuned for the final part of our advanced Python series, where we\u0026rsquo;ll explore another exciting module and its applications!\nHappy coding!\n","date":"17 January 2025","externalUrl":null,"permalink":"/blog/python-supercharge-lists-bisect/","section":"Blog","summary":"","title":"Supercharge lists in Python: bisect","type":"blog"},{"content":"Today, we\u0026rsquo;re diving into one of the most underrated power tools in your tech arsenal: the SSH configuration file. Whether you\u0026rsquo;re juggling multiple servers, tired of typing the same long commands, or just curious about what this file can do, you\u0026rsquo;re in the right place. By the end of this post, you\u0026rsquo;ll be wielding SSH like a pro!\nWhat is SSH, and Why the .ssh/config File? # SSH (Secure Shell) is the gateway to managing remote servers securely. Normally, connecting to a remote server means typing out a lengthy command like this:\nssh user@192.168.1.10 -p 2222 -i /path/to/key Tedious, right? That\u0026rsquo;s where the .ssh/config file comes to the rescue. It lets you define reusable configurations for your SSH connections, turning the command above into something as simple as:\nssh myserver Sounds cool? Let\u0026rsquo;s configure it!\nSetting Up Your .ssh/config File # Here\u0026rsquo;s what we\u0026rsquo;re gonna do:\nLocate or create the .ssh/config file. Define custom connection profiles. Test and tweak as needed. Step 1: Locate or Create the File # The .ssh/config file lives in your home directory. If it doesn\u0026rsquo;t exist yet, no worries-we\u0026rsquo;ll create it.\ncd ~/.ssh nano config Note: Make sure the .ssh directory and the config file have proper permissions:\nchmod 700 ~/.ssh chmod 600 ~/.ssh/config Step 2: Define Connection Profiles # Let\u0026rsquo;s make your life easier by adding custom profiles. Here\u0026rsquo;s a simple example:\nHost myserver HostName 192.168.1.10 User user Port 2222 IdentityFile ~/.ssh/id_rsa What\u0026rsquo;s Happening Here?\nHost: A nickname for your connection. Use it when running the ssh command. HostName: The server\u0026rsquo;s address (IP or domain). User: Your login username. Port: The SSH port (default is 22). IdentityFile: Path to your SSH private key. See the ssh man page for the key definitions.\nStep 3: Test Your Connection # Save the file, then test your shiny new setup:\nssh myserver Hey presto! If everything is configured correctly, you\u0026rsquo;ll connect without typing the long-winded command.\nWarning: If you\u0026rsquo;re having trouble, double-check the file permissions and paths.\nAdvanced .ssh/config Features # Feeling adventurous? Here are some pro-level tricks:\n1. Wildcard Hosts # Got multiple servers with similar patterns? Use wildcards:\nHost server-* User admin IdentityFile ~/.ssh/admin_key Now, ssh server-1 or ssh server-2 will automatically apply this config.\n2. ProxyJump (Jump Hosts) # Need to connect through an intermediate server? Use ProxyJump:\nHost internal-server HostName 10.0.0.5 User user ProxyJump gateway-server This connects to internal-server via gateway-server. No manual tunneling needed!\n3. Example: Using a Jump Host for a Remote Database # This example configures network-local-db as a jump host to access a database on another network:\nHost network-local-db HostName 192.168.0.11 User user IdentityFile ~/.ssh/db_key LocalForward 3306 10.66.4.22:3306 What\u0026rsquo;s Happening Here?\nLocalForward: Forwards traffic from localhost:3306 to 10.66.4.22:3306 on the jump host, making the private database accessible locally. To test it, simply run:\nssh network-local-db Connect to a database on localhost:3306 and the data will be proxied through to the database in the private network.\n4. Example: Running a Local Command After Login # Here\u0026rsquo;s how you can run a command to update a login log file after connecting:\nHost log-updater HostName 192.168.0.12 User user PermitLocalCommand yes LocalCommand echo \u0026#34;Login on $(date)\u0026#34; \u0026gt;\u0026gt; ~/ssh_login.log What\u0026rsquo;s Happening Here?\nPermitLocalCommand: Enables the use of LocalCommand. LocalCommand: Appends the login time and date to a local log file (~/ssh_login.log). To test it, run:\nssh log-updater After connecting, check the contents of ~/ssh_login.log to verify the update.\nConclusion # Congratulations, you\u0026rsquo;ve just unlocked the magic of the .ssh/config file! No more repetitive typing, no more juggling keys and ports. Whether you\u0026rsquo;re managing a single server or an entire fleet, this file streamlines your workflow and saves you precious time.\nSo, go ahead and start experimenting. Add your favourite servers, try out wildcards, or dive into advanced features like ProxyJump. The sky\u0026rsquo;s the limit!\n","date":"7 January 2025","externalUrl":null,"permalink":"/blog/mastering-ssh-with-the-config-file/","section":"Blog","summary":"","title":"Mastering ssh With The Config File","type":"blog"},{"content":"","date":"7 January 2025","externalUrl":null,"permalink":"/tags/ssh/","section":"Tags","summary":"","title":"Ssh","type":"tags"},{"content":" Using MagicMock Side Effect for Custom Mock Behaviors in Python # When writing unit tests in Python, you often need to mock objects or functions to isolate the code you\u0026rsquo;re testing. The unittest.mock module provides the MagicMock class, which allows you to create mock objects with flexible behaviors. One of the most powerful features of MagicMock is its side_effect attribute. This lets you define custom return values or behaviors that change with each call to the mocked method, offering greater control over how the mock behaves across multiple invocations.\nFor example, imagine you\u0026rsquo;re testing a function that calls an external API. If you want the mock to return different responses on subsequent calls, you can use the side_effect feature. Setting side_effect to a list, like [None, (1, 2, 3, 4)], will make the mock return None on the first call, and a tuple (1, 2, 3, 4) on the second call. This is particularly useful when you need to simulate varying results or test different scenarios within a single test case.\nfrom unittest.mock import MagicMock # Create the mock object mock_obj = MagicMock() # Set side_effect to return different values on each call mock_obj.execute.side_effect = [None, (1, 2, 3, 4)] # Test the mock behavior print(mock_obj.execute()) # Output: None (first call) print(mock_obj.execute()) # Output: (1, 2, 3, 4) (second call) In this example, the mock\u0026rsquo;s execute() method returns None on the first call and a tuple (1, 2, 3, 4) on the second. Using a list for side_effect allows you to control the return value on a call-by-call basis.\nfrom unittest.mock import MagicMock # Create the mock object mock_obj = MagicMock() # Set side_effect to a lambda function mock_obj.execute.side_effect = lambda x: x * 2 # Test the mock with custom logic print(mock_obj.execute(5)) # Output: 10 (5 * 2) print(mock_obj.execute(7)) # Output: 14 (7 * 2) Here, side_effect is assigned a lambda function, which doubles the input value each time execute() is called. This provides dynamic behavior where the return value is based on the arguments passed.\nfrom unittest.mock import MagicMock # Create the mock object mock_obj = MagicMock() # Set side_effect to raise an exception mock_obj.execute.side_effect = ValueError(\u0026#34;An error occurred\u0026#34;) # Test the mock behavior with exception raising try: mock_obj.execute() except ValueError as e: print(f\u0026#34;Exception raised: {e}\u0026#34;) # Output: Exception raised: An error occurred In this example, we use side_effect to raise an exception when execute() is called. This is useful for testing error handling in your code when interacting with mocked methods or external dependencies.\nside_effect can be used with a callable (such as a function or lambda), which allows for even more dynamic behaviors. This makes it easy to simulate complex logic, like raising exceptions on certain calls, or returning values based on input arguments. The flexibility of side_effect ensures that your tests are as close to real-world conditions as possible, even when interacting with mocked or simulated components. Whether you\u0026rsquo;re working with simple mocks or need more intricate behavior, MagicMock\u0026rsquo;s side_effect is an essential tool in any tester\u0026rsquo;s toolkit.\n","date":"3 January 2025","externalUrl":null,"permalink":"/blog/magicmock-and-side_effect-tricks/","section":"Blog","summary":"","title":"MagicMock and side_effect tricks","type":"blog"},{"content":"","date":"3 January 2025","externalUrl":null,"permalink":"/tags/testing/","section":"Tags","summary":"","title":"Testing","type":"tags"},{"content":"","date":"6 January 2024","externalUrl":null,"permalink":"/tags/1brc/","section":"Tags","summary":"","title":"1brc","type":"tags"},{"content":"This blog post is a copy of 1brc or The Billion Row Challenge - The goal, as per Gunnar Morlings 1brc is stated as:\nThe text file contains temperature values for a range of weather stations. Each row is one measurement in the format \u0026lt;string: station name\u0026gt;;\u0026lt;double: measurement\u0026gt;, with the measurement value having exactly one fractional digit. The following shows ten rows as an example:\nHamburg;12.0 Bulawayo;8.9 Palembang;38.8 St. John\u0026#39;s;15.2 Cracow;12.6 Bridgetown;26.9 Istanbul;6.2 Roseau;34.4 Conakry;31.2 Istanbul;23.0 The task is to write a Java program which reads the file, calculates the min, mean, and max temperature value per weather station, and emits the results on stdout like this (i.e. sorted alphabetically by station name, and the result values per station in the format \u0026lt;min\u0026gt;/\u0026lt;mean\u0026gt;/\u0026lt;max\u0026gt;, rounded to one fractional digit):\n{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, Abéché=-10.0/29.4/69.0, Accra=-10.1/26.4/66.4, Addis Ababa=-23.7/16.0/67.0, Adelaide=-27.8/17.3/58.5, ...} Obviously, this will be done in Python in this case! As of writing the leaderboard is looking quite fast! With a result of 00:12.063/00:59.430/04:13.449.\nResults are determined by running the program on a Hetzner Cloud CCX33 instance (8 dedicated vCPU, 32 GB RAM)\nI\u0026rsquo;m running everything locally on my M1 Mac with a M1 Max chip. According to my brief research into the chips, the Hertzner cloud runs on I\u0026rsquo;m going to assume my times will be faster by up to a factor of 2. That means to get anywhere near performant be aiming for ~2 mins and I\u0026rsquo;ll be clinging on the slowest times of the original 1brc.\nThese are going to be some hard numbers to beat, especially in Python - but if we can get under our target that would be amazing. My approach will take a few stages, iteratively trying to get better with different optimisation approaches.\n0 - ASAP With Polars # It\u0026rsquo;s a nice place to start, a tool I\u0026rsquo;m familiar with and will likely be relatively fast thanks to its low-level implementations of operations using the Arrow in-memory storage.\nimport logging from pathlib import Path import polars as pl from brc.util import DATA_DIR, timit_context def main(file_path: Path): lf = pl.scan_csv( file_path, has_header=False, separator=\u0026#34;;\u0026#34;, with_column_names=lambda _: [\u0026#34;city\u0026#34;, \u0026#34;temp\u0026#34;], schema={\u0026#34;city\u0026#34;: pl.Categorical, \u0026#34;temp\u0026#34;: pl.Float64} ) q = ( lf.group_by(\u0026#34;city\u0026#34;) .agg( pl.min(\u0026#34;temp\u0026#34;).alias(\u0026#34;min_temp\u0026#34;), pl.mean(\u0026#34;temp\u0026#34;).round(1).alias(\u0026#34;mean_temp\u0026#34;), pl.max(\u0026#34;temp\u0026#34;).alias(\u0026#34;max_temp\u0026#34;), ) .select( pl.col(\u0026#34;city\u0026#34;), pl.concat_str([ pl.col(\u0026#34;min_temp\u0026#34;), pl.col(\u0026#34;mean_temp\u0026#34;), pl.col(\u0026#34;max_temp\u0026#34;) ], separator=\u0026#34;/\u0026#34;).alias(\u0026#34;_res_str\u0026#34;) ) .sort(\u0026#34;city\u0026#34;) .select( pl.concat_str([ pl.col(\u0026#34;city\u0026#34;), pl.col(\u0026#34;_res_str\u0026#34;) ], separator=\u0026#34;=\u0026#34;).alias(\u0026#34;res_str\u0026#34;) ) ) res = q.collect() print(\u0026#34;, \u0026#34;.join(res[\u0026#34;res_str\u0026#34;])) if __name__ == \u0026#39;__main__\u0026#39;: logging.basicConfig(level=logging.INFO) path = DATA_DIR / \u0026#34;data_1_000_000_000.txt\u0026#34; with timit_context(1_000_000_000): main(path) In ~30 lines of code, we can do all the parsing, maths and string concat operation to create the result. On my machine, this took just over 1 minute*. Given how fast I was able to put this together using industry-standard Python libs is pretty sweet, that and the fact I\u0026rsquo;m now able to move on to other work rather than mess around using multiprocessing and parsing algos means the total cost of this solution is very cheap.\nINFO:brc.util:benchmark \u0026gt;\u0026gt;\u0026gt; Starting timer INFO:brc.util:benchmark \u0026lt;\u0026lt;\u0026lt; Elapsed time: 0:01:00.263331 INFO:brc.util:benchmark \u0026lt;\u0026lt;\u0026lt; with 1,000,000,000 rows =\u0026gt; 16,593,838.763 rows/s. HOWEVER\nLiterally rule 2 mentions (rule 1 is about using Java… rules are meant to be broken right?)\nNo external library dependencies may be used\nSo let\u0026rsquo;s do it with no libs\u0026hellip;\n1 - Make a Python Script to Read Then Calculate # The file I\u0026rsquo;m working with is ~13Gb, and my laptop has 64Gb, so I can do this all in memory. I\u0026rsquo;ll probably end up not doing it all in memory as the target machine has less than 8Gb but I\u0026rsquo;ll cross that bridge when I get there.\nHere is the code:\nimport csv import logging from dataclasses import dataclass from pathlib import Path from brc.util import DATA_DIR, timit_context def main(file_path: Path): @dataclass class CollectionStruct(): city: bytes min_temp: float max_temp: float sum_temps: float count: int def mean(self): return self.sum_temps / self.count def __repr__(self): return f\u0026#34;{self.city.decode(\u0026#34;utf-8\u0026#34;)}={self.min_temp}/{self.mean():,.1f}/{self.max_temp}\u0026#34; temps: dict[bytes, CollectionStruct] = {} with open(file_path, \u0026#39;rb\u0026#39;) as f: for line in f: # parse line data city, temp = line.split(b\u0026#34;;\u0026#34;) temp = float(temp) # grab existing data collect = temps.get(city, CollectionStruct(city, temp, temp, 0, 0)) # add line to the collection settings collect.count += 1 collect.sum_temps += temp collect.min_temp = min(collect.min_temp, temp) collect.max_temp = max(collect.max_temp, temp) # update the dictionary temps[city] = collect # ordering here is slightly quicker than the dataclass ordering print(\u0026#34; \u0026#34;.join((str(s) for s in sorted(temps.values(), key=lambda c: c.city)))) if __name__ == \u0026#34;__main__\u0026#34;: logging.basicConfig(level=logging.INFO) # This solution will not be fast, 100mm records is enough for a sample at the moment. n = 100_000_000 path = DATA_DIR / f\u0026#34;data_{n:_}.txt\u0026#34; with timit_context(n, \u0026#34;basic\u0026#34;): main(path) with timit_context(n, \u0026#34;basic_profile\u0026#34;, profile=True): main(path) We use data classes and a non-parallelized approach, giving us a result for 100mm records. This first step is really to gauge how fast native Python is vs polars and estimate how long 1bn records would take. the results are below:\nINFO:brc.util:basic \u0026gt;\u0026gt;\u0026gt; Starting timer Abha=-24.7/18.0/64.2 Abidjan=-21.7/26.0/75.3 Abéché=-18.3/29.4/71.0 Accra=-17.0/26.4/75.2 Addis Ababa=-32.3/16.0/62.3 Adelaide=-24.6/17.3/67.1 Aden=-15.0/29.1/79.9 Ahvaz=-16.6/25.4/69.0 Albuquerque=-30.7/14.0/57.3 Alexandra=-29.9/11.0/59.4 Alexandria=-26.4/20.0/68.2 Algiers=-28.0/18.2/75.1 Alice Springs=-24.9/21.0/64.4 Almaty=-35.0/10.0/59.4 Amsterdam=-36.3/10.2/55.2 Anadyr=-53.0/-6.9/38.5 Anchorage=-41.4/2.8/48.7 Andorra la Vella=-34.0/9.8/52.5 Ankara=-33.9/12.0/57.9 Antananarivo=-27.4/17.9/63.5 Antsiranana=-27.8/25.2/71.0 Arkhangelsk=-46.6/1.3/47.2 Ashgabat=-24.9/17.1/58.7 Asmara=-29.9/15.6/59.1 Assab=-14.3/30.5/77.0 Astana=-41.0/3.5/48.5 Athens=-27.0/19.2/60.9 Atlanta=-28.9/17.0/64.1 Auckland=-34.3/15.2/60.9 Austin=-27.8/20.7/64.5 Baghdad=-19.7/22.8/71.7 Baguio=-32.6/19.5/62.6 Baku=-31.6/15.1/56.7 Baltimore=-33.6/13.1/59.0 Bamako=-17.8/27.8/72.4 Bangkok=-15.4/28.6/77.0 Bangui=-20.0/26.0/68.1 Banjul=-25.3/26.0/69.6 Barcelona=-26.9/18.2/60.9 Bata=-21.1/25.1/72.3 Batumi=-28.4/14.0/58.0 Beijing=-33.8/12.9/57.1 Beirut=-23.1/20.9/71.3 Belgrade=-36.3/12.5/57.4 Belize City=-17.6/26.7/72.0 Benghazi=-25.8/19.9/65.7 Bergen=-36.0/7.7/52.9 Berlin=-33.8/10.3/56.0 Bilbao=-30.1/14.7/62.1 Birao=-18.1/26.5/68.8 Bishkek=-35.1/11.2/55.8 Bissau=-18.1/27.0/67.8 Blantyre=-24.9/22.2/67.6 Bloemfontein=-28.6/15.6/58.9 Boise=-31.2/11.4/55.5 Bordeaux=-28.1/14.2/62.1 Bosaso=-13.0/30.0/80.5 Boston=-34.1/10.9/58.4 Bouaké=-20.0/26.0/71.8 Bratislava=-32.7/10.5/56.9 Brazzaville=-22.7/25.0/68.6 Bridgetown=-26.5/27.0/75.3 Brisbane=-24.2/21.4/65.7 Brussels=-36.0/10.5/56.7 Bucharest=-33.6/10.8/57.6 Budapest=-32.0/11.3/54.4 Bujumbura=-20.4/23.8/66.7 Bulawayo=-23.3/18.9/62.5 Burnie=-28.4/13.1/55.7 Busan=-31.2/15.0/59.8 Cabo San Lucas=-18.8/23.9/69.9 Cairns=-21.4/25.0/68.7 Cairo=-26.3/21.4/65.1 Calgary=-43.2/4.4/51.9 Canberra=-31.3/13.1/53.5 Cape Town=-26.2/16.2/61.6 Changsha=-26.7/17.4/61.8 Charlotte=-31.8/16.1/62.2 Chiang Mai=-27.9/25.8/73.8 Chicago=-34.0/9.8/57.3 Chihuahua=-22.9/18.6/66.5 Chittagong=-18.0/25.9/69.4 Chișinău=-33.2/10.2/53.7 Chongqing=-28.9/18.6/64.7 Christchurch=-32.5/12.2/55.1 City of San Marino=-33.0/11.8/54.9 Colombo=-14.7/27.4/70.1 Columbus=-33.4/11.7/58.3 Conakry=-22.4/26.4/74.4 Copenhagen=-39.0/9.1/53.6 Cotonou=-19.5/27.2/73.9 Cracow=-33.1/9.3/56.8 Da Lat=-28.3/17.9/62.7 Da Nang=-17.3/25.8/68.8 Dakar=-19.6/24.0/70.2 Dallas=-32.3/19.0/64.6 Damascus=-31.3/17.0/60.5 Dampier=-16.7/26.4/72.9 Dar es Salaam=-17.6/25.8/69.0 Darwin=-17.3/27.6/75.4 Denpasar=-20.9/23.7/71.2 Denver=-34.0/10.4/52.7 Detroit=-36.5/10.0/54.7 Dhaka=-18.7/25.9/75.9 Dikson=-54.8/-11.1/35.7 Dili=-19.4/26.6/73.0 Djibouti=-15.8/30.0/74.0 Dodoma=-25.4/22.7/71.5 Dolisie=-20.8/24.0/78.6 Douala=-18.2/26.7/70.5 Dubai=-17.4/26.9/71.4 Dublin=-37.7/9.8/55.1 Dunedin=-35.6/11.1/59.0 Durban=-24.3/20.6/65.0 Dushanbe=-29.2/14.7/58.0 Edinburgh=-42.8/9.3/53.5 Edmonton=-37.4/4.2/47.2 El Paso=-27.3/18.1/59.9 Entebbe=-24.0/21.0/67.6 Erbil=-25.5/19.5/67.4 Erzurum=-40.0/5.1/48.5 Fairbanks=-47.3/-2.3/46.9 Fianarantsoa=-26.0/17.9/68.7 Flores, Petén=-17.9/26.4/72.2 Frankfurt=-32.5/10.6/56.5 Fresno=-28.5/17.9/62.4 Fukuoka=-28.5/17.0/63.1 Gaborone=-28.0/21.0/62.9 Gabès=-25.3/19.5/62.2 Gagnoa=-17.9/26.0/69.4 Gangtok=-28.0/15.2/60.9 Garissa=-14.1/29.3/77.3 Garoua=-17.2/28.3/73.2 George Town=-16.4/27.9/74.8 Ghanzi=-27.5/21.4/64.8 Gjoa Haven=-58.1/-14.4/27.9 Guadalajara=-23.8/20.9/63.1 Guangzhou=-24.8/22.4/69.7 Guatemala City=-21.5/20.4/68.1 Halifax=-35.7/7.5/53.2 Hamburg=-35.4/9.7/54.8 Hamilton=-27.2/13.8/58.1 Hanga Roa=-26.0/20.5/64.3 Hanoi=-21.3/23.6/68.5 Harare=-26.5/18.4/62.1 Harbin=-39.7/5.0/51.1 Hargeisa=-23.7/21.7/64.8 Hat Yai=-21.3/27.0/72.8 Havana=-19.0/25.2/75.2 Helsinki=-36.4/5.9/52.7 Heraklion=-25.4/18.9/61.8 Hiroshima=-27.8/16.3/61.3 Ho Chi Minh City=-18.2/27.4/72.1 Hobart=-33.6/12.7/59.0 Hong Kong=-35.6/23.3/68.1 Honiara=-19.2/26.5/68.9 Honolulu=-18.1/25.4/73.0 Houston=-33.4/20.8/62.3 Ifrane=-37.2/11.4/55.5 Indianapolis=-29.2/11.8/57.1 Iqaluit=-53.8/-9.3/35.9 Irkutsk=-46.5/1.0/44.6 Istanbul=-31.4/13.9/57.1 Jacksonville=-30.8/20.3/64.4 Jakarta=-20.7/26.7/75.4 Jayapura=-16.5/27.0/73.3 Jerusalem=-28.8/18.3/61.3 Johannesburg=-34.8/15.5/65.7 Jos=-28.8/22.8/67.9 Juba=-15.9/27.8/71.0 Kabul=-36.0/12.1/55.8 Kampala=-27.2/20.0/60.9 Kandi=-18.0/27.7/71.1 Kankan=-21.1/26.5/69.7 Kano=-17.0/26.4/77.7 Kansas City=-34.4/12.5/56.0 Karachi=-20.0/26.1/72.6 Karonga=-20.6/24.4/70.9 Kathmandu=-24.6/18.3/61.9 Khartoum=-18.2/29.9/75.5 Kingston=-19.3/27.4/69.8 Kinshasa=-22.8/25.3/72.3 Kolkata=-20.8/26.7/72.2 Kuala Lumpur=-20.9/27.3/74.6 Kumasi=-17.5/26.0/70.2 Kunming=-32.8/15.7/58.4 Kuopio=-39.3/3.4/47.6 Kuwait City=-21.4/25.7/69.9 Kyiv=-39.3/8.4/53.3 Kyoto=-28.0/15.8/60.4 La Ceiba=-23.4/26.2/73.7 La Paz=-21.0/23.7/70.2 Lagos=-19.1/26.8/71.2 Lahore=-18.5/24.3/66.0 Lake Havasu City=-27.3/23.7/69.6 Lake Tekapo=-33.8/8.7/49.9 Las Palmas de Gran Canaria=-22.6/21.2/70.3 Las Vegas=-21.9/20.3/65.4 Launceston=-33.1/13.1/57.9 Lhasa=-39.3/7.6/50.8 Libreville=-28.0/25.9/69.4 Lisbon=-28.0/17.5/60.4 Livingstone=-25.6/21.8/66.4 Ljubljana=-37.7/10.9/54.7 Lodwar=-16.8/29.3/80.2 Lomé=-18.6/26.9/68.5 London=-40.4/11.4/55.8 Los Angeles=-29.5/18.6/61.0 Louisville=-32.2/13.9/59.1 Luanda=-16.9/25.8/72.2 Lubumbashi=-27.2/20.8/66.9 Lusaka=-24.0/19.9/65.5 Luxembourg City=-36.7/9.3/51.7 Lviv=-34.3/7.8/57.1 Lyon=-37.4/12.5/58.5 Madrid=-29.9/15.0/59.5 Mahajanga=-22.2/26.3/70.4 Makassar=-17.1/26.7/72.3 Makurdi=-22.2/26.0/73.5 Malabo=-18.8/26.3/71.3 Malé=-13.9/28.0/75.5 Managua=-14.7/27.3/73.1 Manama=-18.8/26.5/72.8 Mandalay=-18.7/28.0/74.7 Mango=-15.7/28.1/80.4 Manila=-16.2/28.4/79.0 Maputo=-26.1/22.8/71.1 Marrakesh=-25.2/19.6/62.5 Marseille=-28.0/15.8/63.9 Maun=-23.1/22.4/68.5 Medan=-20.7/26.5/75.0 Mek\u0026#39;ele=-22.9/22.7/64.3 Melbourne=-32.6/15.1/60.9 Memphis=-24.0/17.2/60.4 Mexicali=-20.2/23.1/70.7 Mexico City=-26.7/17.5/61.6 Miami=-18.2/24.9/71.1 Milan=-34.5/12.9/58.2 Milwaukee=-35.3/8.9/55.5 Minneapolis=-40.2/7.8/51.1 Minsk=-37.7/6.7/50.7 Mogadishu=-26.3/27.1/72.3 Mombasa=-19.6/26.3/69.7 Monaco=-29.6/16.4/62.9 Moncton=-40.1/6.1/47.4 Monterrey=-24.0/22.3/72.7 Montreal=-40.8/6.8/53.3 Moscow=-36.8/5.8/51.0 Mumbai=-14.3/27.1/71.7 Murmansk=-44.2/0.6/51.2 Muscat=-17.8/28.0/73.1 Mzuzu=-34.9/17.7/61.8 N\u0026#39;Djamena=-13.8/28.3/72.5 Naha=-21.1/23.1/71.9 Nairobi=-28.7/17.8/62.3 Nakhon Ratchasima=-31.1/27.3/69.4 Napier=-31.7/14.6/58.4 Napoli=-30.2/15.9/64.8 Nashville=-30.3/15.4/58.6 Nassau=-18.1/24.6/66.3 Ndola=-24.0/20.3/70.7 New Delhi=-18.9/25.0/68.7 New Orleans=-23.3/20.7/71.7 New York City=-29.1/12.9/55.5 Ngaoundéré=-21.7/22.0/65.4 Niamey=-14.6/29.3/75.3 Nicosia=-23.9/19.7/63.2 Niigata=-36.6/13.9/61.5 Nouadhibou=-25.4/21.3/66.4 Nouakchott=-16.6/25.7/73.6 Novosibirsk=-44.5/1.7/45.8 Nuuk=-50.4/-1.4/44.2 Odesa=-33.7/10.7/56.9 Odienné=-18.0/26.0/70.1 Oklahoma City=-33.1/15.9/57.5 Omaha=-34.4/10.6/54.4 Oranjestad=-15.9/28.1/74.9 Oslo=-36.5/5.7/53.2 Ottawa=-42.3/6.6/53.2 Ouagadougou=-20.4/28.3/72.6 Ouahigouya=-17.6/28.6/81.3 Ouarzazate=-23.8/18.9/62.6 Oulu=-41.3/2.7/49.8 Palembang=-17.9/27.4/69.5 Palermo=-30.1/18.5/64.7 Palm Springs=-17.5/24.5/71.0 Palmerston North=-34.5/13.2/57.7 Panama City=-17.0/28.0/75.4 Parakou=-23.1/26.8/70.4 Paris=-30.4/12.3/58.0 Perth=-27.3/18.7/68.1 Petropavlovsk-Kamchatsky=-46.8/1.9/55.8 Philadelphia=-42.3/13.2/60.0 Phnom Penh=-15.1/28.3/75.1 Phoenix=-19.4/23.9/68.4 Pittsburgh=-36.9/10.8/54.9 Podgorica=-27.8/15.3/59.0 Pointe-Noire=-17.5/26.1/71.9 Pontianak=-18.5/27.7/73.4 Port Moresby=-21.2/26.9/71.1 Port Sudan=-16.9/28.4/76.2 Port Vila=-20.3/24.3/78.5 Port-Gentil=-17.9/26.0/72.1 Portland (OR)=-32.4/12.4/58.2 Porto=-28.6/15.7/66.9 Prague=-39.9/8.4/54.1 Praia=-24.9/24.5/67.9 Pretoria=-25.7/18.2/65.2 Pyongyang=-37.2/10.8/58.5 Rabat=-29.0/17.2/68.8 Rangpur=-19.4/24.4/71.2 Reggane=-20.3/28.3/74.2 Reykjavík=-45.8/4.3/48.5 Riga=-40.5/6.2/49.2 Riyadh=-21.0/26.0/74.9 Rome=-28.8/15.2/60.5 Roseau=-27.9/26.2/74.2 Rostov-on-Don=-34.1/9.9/56.6 Sacramento=-29.2/16.3/71.8 Saint Petersburg=-41.4/5.9/55.0 Saint-Pierre=-36.7/5.7/56.6 Salt Lake City=-33.3/11.6/57.8 San Antonio=-22.4/20.8/65.3 San Diego=-29.5/17.8/64.8 San Francisco=-32.1/14.6/57.1 San Jose=-28.6/16.4/59.2 San José=-20.3/22.6/72.2 San Juan=-20.3/27.2/74.4 San Salvador=-24.7/23.1/75.7 Sana\u0026#39;a=-24.3/20.0/64.5 Santo Domingo=-18.1/25.9/72.0 Sapporo=-37.9/8.9/54.9 Sarajevo=-35.5/10.1/52.3 Saskatoon=-39.3/3.3/49.4 Seattle=-35.7/11.3/56.7 Seoul=-30.8/12.5/60.5 Seville=-25.6/19.2/62.9 Shanghai=-28.9/16.7/63.3 Singapore=-24.4/27.0/75.9 Skopje=-32.6/12.4/58.4 Sochi=-31.0/14.2/58.3 Sofia=-36.0/10.6/55.2 Sokoto=-14.3/28.0/77.0 Split=-29.6/16.1/59.9 St. John\u0026#39;s=-39.7/5.0/52.4 St. Louis=-33.0/13.9/60.5 Stockholm=-37.2/6.6/50.5 Surabaya=-19.1/27.1/73.8 Suva=-17.0/25.6/73.7 Suwałki=-36.5/7.2/49.3 Sydney=-27.2/17.7/61.4 Ségou=-18.4/28.0/71.0 Tabora=-23.3/23.0/68.3 Tabriz=-32.4/12.6/55.6 Taipei=-26.8/23.0/70.1 Tallinn=-39.9/6.4/57.4 Tamale=-18.1/27.9/77.0 Tamanrasset=-22.2/21.7/66.9 Tampa=-21.9/22.9/69.5 Tashkent=-40.1/14.8/60.7 Tauranga=-28.8/14.8/57.8 Tbilisi=-37.6/12.9/60.5 Tegucigalpa=-19.9/21.7/69.5 Tehran=-30.4/17.0/62.9 Tel Aviv=-25.1/20.0/64.7 Thessaloniki=-29.2/16.0/58.7 Thiès=-23.2/24.0/68.1 Tijuana=-36.8/17.8/60.9 Timbuktu=-17.6/28.0/79.4 Tirana=-30.9/15.2/59.6 Toamasina=-19.1/23.4/71.5 Tokyo=-33.1/15.4/61.8 Toliara=-17.9/24.1/70.8 Toluca=-30.2/12.4/54.3 Toronto=-37.3/9.4/57.0 Tripoli=-24.3/20.0/67.7 Tromsø=-45.8/2.9/49.3 Tucson=-26.5/20.9/63.6 Tunis=-27.3/18.4/69.2 Ulaanbaatar=-43.0/-0.4/51.9 Upington=-28.7/20.4/65.1 Vaduz=-32.3/10.1/58.5 Valencia=-25.2/18.3/61.6 Valletta=-24.7/18.8/60.1 Vancouver=-34.8/10.4/58.6 Veracruz=-19.0/25.4/74.8 Vienna=-31.7/10.4/51.9 Vientiane=-16.8/25.9/74.6 Villahermosa=-15.2/27.1/73.7 Vilnius=-35.7/6.0/54.1 Virginia Beach=-28.7/15.8/61.2 Vladivostok=-38.6/4.9/50.3 Warsaw=-36.9/8.5/56.9 Washington, D.C.=-36.2/14.6/61.2 Wau=-18.9/27.8/72.2 Wellington=-35.2/12.9/61.3 Whitehorse=-45.5/-0.1/44.4 Wichita=-31.0/13.9/67.7 Willemstad=-15.2/28.0/73.8 Winnipeg=-40.7/3.0/52.7 Wrocław=-33.6/9.6/57.4 Xi\u0026#39;an=-27.7/14.1/57.1 Yakutsk=-56.0/-8.8/39.7 Yangon=-20.2/27.5/76.9 Yaoundé=-19.8/23.8/68.6 Yellowknife=-46.6/-4.3/41.0 Yerevan=-36.8/12.4/54.3 Yinchuan=-36.8/9.0/53.0 Zagreb=-33.1/10.7/55.5 Zanzibar City=-22.2/26.0/66.9 Zürich=-34.7/9.3/53.0 Ürümqi=-37.8/7.4/49.4 İzmir=-26.5/17.9/61.3 INFO:brc.util:basic \u0026lt;\u0026lt;\u0026lt; Elapsed time: 0:01:13.951427 INFO:brc.util:basic \u0026lt;\u0026lt;\u0026lt; with 100,000,000 rows =\u0026gt; 1,352,238.944 rows/s. brc complete est 0:12:19.514273 ! Note: the different values are due to producing the data of different lengths.\nProfiling this code with cProfile shows the weak points, this result is logged to the screen using the Python builtins. Profiling the code does make it slower - so the below extended time is because of the attached profiler.\n500002507 function calls (500002504 primitive calls) in 161.663 seconds Ordered by: cumulative time List reduced from 108 to 5 due to restriction \u0026lt;5\u0026gt; ncalls tottime percall cumtime percall filename:lineno(function) 1 96.971 96.971 161.663 161.663 /Users/toby.devlin/dev/projects/1brc/src/brc/challenge_basic.py:9(main) 100000000 15.139 0.000 15.139 0.000 {built-in method builtins.max} 100000000 15.119 0.000 15.119 0.000 {method \u0026#39;split\u0026#39; of \u0026#39;bytes\u0026#39; objects} 100000000 15.030 0.000 15.030 0.000 {built-in method builtins.min} 100000013 13.081 0.000 13.081 0.000 {method \u0026#39;get\u0026#39; of \u0026#39;dict\u0026#39; objects} It looks like most of the time (main) is spent doing the min and max for the temp values, then parsing the file lines as if it were a CSV, and then doing the lookups for the collection objects. Fundamentally we will need to change the processing approach as these are all builtins of Python and are considered as fast as they can get. So let\u0026rsquo;s try spreading these operations to other processes.\n2 - Parallelize The Slow Bits # If we look to leverage the multiprocessing and multithreading modules we would be able to pass around data to separate processes to allow these min() max() and split() lines to be run by various processes. In part 1 we saw the profile results showing the code is still CPU bound, not IO bound; this means threading isn\u0026rsquo;t needed (yet) and we should opt for more processes. Multiprocessing is more heavyweight and takes more to launch a new process than a thread. However once the thread is up it\u0026rsquo;s relatively fast and unbound by the GIL. (It\u0026rsquo;s also my personal preference in Python to start with coroutines then processes and approach multiprocessing from a 1:1 core:process implementation design, then thread these processes if needed)\nBelow is the code which allows the summing of various read results to be paralleled across several processes. It essentially batches 100k records being read from the file and places them onto a worker to complete the aggregation process. The largest part of refactoring this code is moving data back and forth from different processes.\nimport logging from collections import defaultdict from dataclasses import dataclass from functools import lru_cache from itertools import batched, chain from multiprocessing import Pool from pathlib import Path from brc.util import DATA_DIR, timit_context @dataclass class CollectionStruct(): city: str min_temp: float max_temp: float sum_temps: float count: int def __init__(self, city: str = \u0026#39;\u0026#39;, init_value: float = 0): self.city = city self.min_temp = init_value self.max_temp = init_value self.sum_temps = init_value self.count = 1 def mean(self): return self.sum_temps / self.count def __add__(self, other: \u0026#34;CollectionStruct\u0026#34;): # this is more of a merge function # the city line is to allow for the default dict default factory. there\u0026#39;s probably a performance hit here self.city = other.city self.count += other.count self.sum_temps += other.sum_temps self.min_temp = min(self.min_temp, other.min_temp) self.max_temp = max(self.max_temp, other.max_temp) return self def __repr__(self): return f\u0026#34;{self.city}={self.min_temp}/{self.mean():.1f}/{self.max_temp}\u0026#34; @lru_cache(None) def parse_city(city: bytes) -\u0026gt; str: return city.decode(\u0026#34;utf-8\u0026#34;) @lru_cache def parse_temp(temp: bytes) -\u0026gt; float: return float(temp) def do_parse(line: bytes): city, temp = line.split(b\u0026#34;;\u0026#34;) temp = float(temp) city = parse_city(city) return CollectionStruct(city, temp) def process_lines(*lines: bytes): \u0026#34;\u0026#34;\u0026#34; Takes a number of lines from the file and aggregates them to a single results dict \u0026#34;\u0026#34;\u0026#34; totals = defaultdict(CollectionStruct) for line in (do_parse(l) for l in lines): totals[line.city] += line return totals def main(file_path: Path): totals = defaultdict(CollectionStruct) # read in file and batch lines in groups to a process to aggregate with open(file_path, \u0026#39;rb\u0026#39;) as f: # 8 to match the change target machine cores. with Pool(8) as pool: # this is slightly faster than pool.map(process_lines, f, n) thanks to serializing more lines to each # process at once (the slow bit is moving bin objects to python res = pool.starmap(process_lines, batched(f, 250_000)) # group all results (approx n_proc x m_distinct_stations elements) for c in chain.from_iterable((r.values() for r in res)): totals[c.city] += c print(\u0026#34; \u0026#34;.join((str(s) for s in sorted(totals.values(), key=lambda c: c.city)))) if __name__ == \u0026#34;__main__\u0026#34;: logging.basicConfig(level=logging.INFO) n = 100_000_000 path = DATA_DIR / f\u0026#34;data_{n:_}.txt\u0026#34; with timit_context(n, \u0026#34;parallel\u0026#34;): main(path) with timit_context(n, \u0026#34;parallel_profile\u0026#34;, profile=True): main(path) Then we look at the results and profile of this run:\nINFO:brc.util:parallel \u0026gt;\u0026gt;\u0026gt; Starting timer Abha=-24.7/18.0/64.2 Abidjan=-21.7/26.0/75.3 Abéché=-18.3/29.4/71.0 Accra=-17.0/26.3/75.2 Addis Ababa=-32.3/15.9/62.3 Adelaide=-24.6/17.3/67.1 Aden=-15.0/29.1/79.9 Ahvaz=-16.6/25.4/69.0 Albuquerque=-30.7/14.0/57.3 Alexandra=-29.9/11.0/59.4 Alexandria=-26.4/20.0/68.2 Algiers=-28.0/18.2/75.1 Alice Springs=-24.9/21.0/64.4 Almaty=-35.0/10.0/59.4 Amsterdam=-36.3/10.2/55.2 Anadyr=-53.0/-6.9/38.5 Anchorage=-41.4/2.8/48.7 Andorra la Vella=-34.0/9.8/52.5 Ankara=-33.9/12.0/57.9 Antananarivo=-27.4/17.8/63.5 Antsiranana=-27.8/25.2/71.0 Arkhangelsk=-46.6/1.3/47.2 Ashgabat=-24.9/17.1/58.7 Asmara=-29.9/15.6/59.1 Assab=-14.3/30.5/77.0 Astana=-41.0/3.5/48.5 Athens=-27.0/19.2/60.9 Atlanta=-28.9/17.0/64.1 Auckland=-34.3/15.2/60.9 Austin=-27.8/20.7/64.5 Baghdad=-19.7/22.7/71.7 Baguio=-32.6/19.5/62.6 Baku=-31.6/15.1/56.7 Baltimore=-33.6/13.1/59.0 Bamako=-17.8/27.8/72.4 Bangkok=-15.4/28.6/77.0 Bangui=-20.0/26.0/68.1 Banjul=-25.3/26.0/69.6 Barcelona=-26.9/18.1/60.9 Bata=-21.1/25.0/72.3 Batumi=-28.4/14.0/58.0 Beijing=-33.8/12.9/57.1 Beirut=-23.1/20.9/71.3 Belgrade=-36.3/12.5/57.4 Belize City=-17.6/26.7/72.0 Benghazi=-25.8/19.9/65.7 Bergen=-36.0/7.6/52.9 Berlin=-33.8/10.3/56.0 Bilbao=-30.1/14.7/62.1 Birao=-18.1/26.5/68.8 Bishkek=-35.1/11.2/55.8 Bissau=-18.1/27.0/67.8 Blantyre=-24.9/22.2/67.6 Bloemfontein=-28.6/15.6/58.9 Boise=-31.2/11.3/55.5 Bordeaux=-28.1/14.2/62.1 Bosaso=-13.0/29.9/80.5 Boston=-34.1/10.9/58.4 Bouaké=-20.0/25.9/71.8 Bratislava=-32.7/10.5/56.9 Brazzaville=-22.7/24.9/68.6 Bridgetown=-26.5/26.9/75.3 Brisbane=-24.2/21.4/65.7 Brussels=-36.0/10.5/56.7 Bucharest=-33.6/10.8/57.6 Budapest=-32.0/11.2/54.4 Bujumbura=-20.4/23.8/66.7 Bulawayo=-23.3/18.8/62.5 Burnie=-28.4/13.1/55.7 Busan=-31.2/15.0/59.8 Cabo San Lucas=-18.8/23.9/69.9 Cairns=-21.4/25.0/68.7 Cairo=-26.3/21.3/65.1 Calgary=-43.2/4.4/51.9 Canberra=-31.3/13.1/53.5 Cape Town=-26.2/16.2/61.6 Changsha=-26.7/17.4/61.8 Charlotte=-31.8/16.1/62.2 Chiang Mai=-27.9/25.7/73.8 Chicago=-34.0/9.8/57.3 Chihuahua=-22.9/18.6/66.5 Chittagong=-18.0/25.9/69.4 Chișinău=-33.2/10.2/53.7 Chongqing=-28.9/18.6/64.7 Christchurch=-32.5/12.2/55.1 City of San Marino=-33.0/11.8/54.9 Colombo=-14.7/27.3/70.1 Columbus=-33.4/11.7/58.3 Conakry=-22.4/26.3/74.4 Copenhagen=-39.0/9.1/53.6 Cotonou=-19.5/27.1/73.9 Cracow=-33.1/9.3/56.8 Da Lat=-28.3/17.9/62.7 Da Nang=-17.3/25.7/68.8 Dakar=-19.6/24.0/70.2 Dallas=-32.3/19.0/64.6 Damascus=-31.3/16.9/60.5 Dampier=-16.7/26.3/72.9 Dar es Salaam=-17.6/25.7/69.0 Darwin=-17.3/27.5/75.4 Denpasar=-20.9/23.7/71.2 Denver=-34.0/10.4/52.7 Detroit=-36.5/10.0/54.7 Dhaka=-18.7/25.9/75.9 Dikson=-54.8/-11.1/35.7 Dili=-19.4/26.6/73.0 Djibouti=-15.8/29.9/74.0 Dodoma=-25.4/22.7/71.5 Dolisie=-20.8/24.0/78.6 Douala=-18.2/26.7/70.5 Dubai=-17.4/26.8/71.4 Dublin=-37.7/9.8/55.1 Dunedin=-35.6/11.1/59.0 Durban=-24.3/20.6/65.0 Dushanbe=-29.2/14.7/58.0 Edinburgh=-42.8/9.3/53.5 Edmonton=-37.4/4.2/47.2 El Paso=-27.3/18.0/59.9 Entebbe=-24.0/21.0/67.6 Erbil=-25.5/19.4/67.4 Erzurum=-40.0/5.1/48.5 Fairbanks=-47.3/-2.3/46.9 Fianarantsoa=-26.0/17.9/68.7 Flores, Petén=-17.9/26.3/72.2 Frankfurt=-32.5/10.6/56.5 Fresno=-28.5/17.8/62.4 Fukuoka=-28.5/17.0/63.1 Gaborone=-28.0/21.0/62.9 Gabès=-25.3/19.4/62.2 Gagnoa=-17.9/25.9/69.4 Gangtok=-28.0/15.2/60.9 Garissa=-14.1/29.3/77.3 Garoua=-17.2/28.3/73.2 George Town=-16.4/27.8/74.8 Ghanzi=-27.5/21.4/64.8 Gjoa Haven=-58.1/-14.4/27.9 Guadalajara=-23.8/20.9/63.1 Guangzhou=-24.8/22.3/69.7 Guatemala City=-21.5/20.3/68.1 Halifax=-35.7/7.5/53.2 Hamburg=-35.4/9.7/54.8 Hamilton=-27.2/13.8/58.1 Hanga Roa=-26.0/20.5/64.3 Hanoi=-21.3/23.6/68.5 Harare=-26.5/18.4/62.1 Harbin=-39.7/5.0/51.1 Hargeisa=-23.7/21.7/64.8 Hat Yai=-21.3/26.9/72.8 Havana=-19.0/25.2/75.2 Helsinki=-36.4/5.9/52.7 Heraklion=-25.4/18.9/61.8 Hiroshima=-27.8/16.3/61.3 Ho Chi Minh City=-18.2/27.3/72.1 Hobart=-33.6/12.7/59.0 Hong Kong=-35.6/23.3/68.1 Honiara=-19.2/26.4/68.9 Honolulu=-18.1/25.4/73.0 Houston=-33.4/20.8/62.3 Ifrane=-37.2/11.4/55.5 Indianapolis=-29.2/11.8/57.1 Iqaluit=-53.8/-9.3/35.9 Irkutsk=-46.5/1.0/44.6 Istanbul=-31.4/13.9/57.1 Jacksonville=-30.8/20.2/64.4 Jakarta=-20.7/26.6/75.4 Jayapura=-16.5/27.0/73.3 Jerusalem=-28.8/18.3/61.3 Johannesburg=-34.8/15.4/65.7 Jos=-28.8/22.8/67.9 Juba=-15.9/27.8/71.0 Kabul=-36.0/12.1/55.8 Kampala=-27.2/19.9/60.9 Kandi=-18.0/27.7/71.1 Kankan=-21.1/26.5/69.7 Kano=-17.0/26.4/77.7 Kansas City=-34.4/12.5/56.0 Karachi=-20.0/26.0/72.6 Karonga=-20.6/24.3/70.9 Kathmandu=-24.6/18.3/61.9 Khartoum=-18.2/29.9/75.5 Kingston=-19.3/27.3/69.8 Kinshasa=-22.8/25.3/72.3 Kolkata=-20.8/26.7/72.2 Kuala Lumpur=-20.9/27.3/74.6 Kumasi=-17.5/25.9/70.2 Kunming=-32.8/15.7/58.4 Kuopio=-39.3/3.4/47.6 Kuwait City=-21.4/25.6/69.9 Kyiv=-39.3/8.4/53.3 Kyoto=-28.0/15.8/60.4 La Ceiba=-23.4/26.2/73.7 La Paz=-21.0/23.6/70.2 Lagos=-19.1/26.8/71.2 Lahore=-18.5/24.3/66.0 Lake Havasu City=-27.3/23.7/69.6 Lake Tekapo=-33.8/8.7/49.9 Las Palmas de Gran Canaria=-22.6/21.2/70.3 Las Vegas=-21.9/20.3/65.4 Launceston=-33.1/13.1/57.9 Lhasa=-39.3/7.6/50.8 Libreville=-28.0/25.9/69.4 Lisbon=-28.0/17.5/60.4 Livingstone=-25.6/21.8/66.4 Ljubljana=-37.7/10.9/54.7 Lodwar=-16.8/29.2/80.2 Lomé=-18.6/26.9/68.5 London=-40.4/11.3/55.8 Los Angeles=-29.5/18.6/61.0 Louisville=-32.2/13.9/59.1 Luanda=-16.9/25.8/72.2 Lubumbashi=-27.2/20.7/66.9 Lusaka=-24.0/19.8/65.5 Luxembourg City=-36.7/9.3/51.7 Lviv=-34.3/7.8/57.1 Lyon=-37.4/12.5/58.5 Madrid=-29.9/15.0/59.5 Mahajanga=-22.2/26.2/70.4 Makassar=-17.1/26.6/72.3 Makurdi=-22.2/26.0/73.5 Malabo=-18.8/26.2/71.3 Malé=-13.9/28.0/75.5 Managua=-14.7/27.2/73.1 Manama=-18.8/26.4/72.8 Mandalay=-18.7/28.0/74.7 Mango=-15.7/28.1/80.4 Manila=-16.2/28.3/79.0 Maputo=-26.1/22.8/71.1 Marrakesh=-25.2/19.5/62.5 Marseille=-28.0/15.8/63.9 Maun=-23.1/22.4/68.5 Medan=-20.7/26.5/75.0 Mek\u0026#39;ele=-22.9/22.7/64.3 Melbourne=-32.6/15.1/60.9 Memphis=-24.0/17.2/60.4 Mexicali=-20.2/23.1/70.7 Mexico City=-26.7/17.5/61.6 Miami=-18.2/24.9/71.1 Milan=-34.5/12.9/58.2 Milwaukee=-35.3/8.9/55.5 Minneapolis=-40.2/7.8/51.1 Minsk=-37.7/6.7/50.7 Mogadishu=-26.3/27.0/72.3 Mombasa=-19.6/26.3/69.7 Monaco=-29.6/16.3/62.9 Moncton=-40.1/6.1/47.4 Monterrey=-24.0/22.2/72.7 Montreal=-40.8/6.8/53.3 Moscow=-36.8/5.8/51.0 Mumbai=-14.3/27.1/71.7 Murmansk=-44.2/0.6/51.2 Muscat=-17.8/27.9/73.1 Mzuzu=-34.9/17.6/61.8 N\u0026#39;Djamena=-13.8/28.3/72.5 Naha=-21.1/23.0/71.9 Nairobi=-28.7/17.8/62.3 Nakhon Ratchasima=-31.1/27.2/69.4 Napier=-31.7/14.6/58.4 Napoli=-30.2/15.9/64.8 Nashville=-30.3/15.4/58.6 Nassau=-18.1/24.5/66.3 Ndola=-24.0/20.3/70.7 New Delhi=-18.9/25.0/68.7 New Orleans=-23.3/20.7/71.7 New York City=-29.1/12.9/55.5 Ngaoundéré=-21.7/22.0/65.4 Niamey=-14.6/29.2/75.3 Nicosia=-23.9/19.7/63.2 Niigata=-36.6/13.9/61.5 Nouadhibou=-25.4/21.3/66.4 Nouakchott=-16.6/25.6/73.6 Novosibirsk=-44.5/1.7/45.8 Nuuk=-50.4/-1.4/44.2 Odesa=-33.7/10.7/56.9 Odienné=-18.0/26.0/70.1 Oklahoma City=-33.1/15.8/57.5 Omaha=-34.4/10.6/54.4 Oranjestad=-15.9/28.1/74.9 Oslo=-36.5/5.7/53.2 Ottawa=-42.3/6.6/53.2 Ouagadougou=-20.4/28.3/72.6 Ouahigouya=-17.6/28.6/81.3 Ouarzazate=-23.8/18.9/62.6 Oulu=-41.3/2.7/49.8 Palembang=-17.9/27.3/69.5 Palermo=-30.1/18.5/64.7 Palm Springs=-17.5/24.4/71.0 Palmerston North=-34.5/13.1/57.7 Panama City=-17.0/28.0/75.4 Parakou=-23.1/26.8/70.4 Paris=-30.4/12.3/58.0 Perth=-27.3/18.7/68.1 Petropavlovsk-Kamchatsky=-46.8/1.9/55.8 Philadelphia=-42.3/13.2/60.0 Phnom Penh=-15.1/28.3/75.1 Phoenix=-19.4/23.8/68.4 Pittsburgh=-36.9/10.8/54.9 Podgorica=-27.8/15.3/59.0 Pointe-Noire=-17.5/26.0/71.9 Pontianak=-18.5/27.7/73.4 Port Moresby=-21.2/26.8/71.1 Port Sudan=-16.9/28.4/76.2 Port Vila=-20.3/24.3/78.5 Port-Gentil=-17.9/26.0/72.1 Portland (OR)=-32.4/12.3/58.2 Porto=-28.6/15.7/66.9 Prague=-39.9/8.4/54.1 Praia=-24.9/24.4/67.9 Pretoria=-25.7/18.2/65.2 Pyongyang=-37.2/10.7/58.5 Rabat=-29.0/17.2/68.8 Rangpur=-19.4/24.4/71.2 Reggane=-20.3/28.2/74.2 Reykjavík=-45.8/4.3/48.5 Riga=-40.5/6.2/49.2 Riyadh=-21.0/25.9/74.9 Rome=-28.8/15.2/60.5 Roseau=-27.9/26.2/74.2 Rostov-on-Don=-34.1/9.9/56.6 Sacramento=-29.2/16.2/71.8 Saint Petersburg=-41.4/5.9/55.0 Saint-Pierre=-36.7/5.7/56.6 Salt Lake City=-33.3/11.6/57.8 San Antonio=-22.4/20.8/65.3 San Diego=-29.5/17.8/64.8 San Francisco=-32.1/14.6/57.1 San Jose=-28.6/16.4/59.2 San José=-20.3/22.5/72.2 San Juan=-20.3/27.2/74.4 San Salvador=-24.7/23.0/75.7 Sana\u0026#39;a=-24.3/20.0/64.5 Santo Domingo=-18.1/25.9/72.0 Sapporo=-37.9/8.9/54.9 Sarajevo=-35.5/10.1/52.3 Saskatoon=-39.3/3.3/49.4 Seattle=-35.7/11.3/56.7 Seoul=-30.8/12.5/60.5 Seville=-25.6/19.2/62.9 Shanghai=-28.9/16.7/63.3 Singapore=-24.4/26.9/75.9 Skopje=-32.6/12.4/58.4 Sochi=-31.0/14.2/58.3 Sofia=-36.0/10.6/55.2 Sokoto=-14.3/28.0/77.0 Split=-29.6/16.1/59.9 St. John\u0026#39;s=-39.7/5.0/52.4 St. Louis=-33.0/13.9/60.5 Stockholm=-37.2/6.6/50.5 Surabaya=-19.1/27.0/73.8 Suva=-17.0/25.6/73.7 Suwałki=-36.5/7.2/49.3 Sydney=-27.2/17.7/61.4 Ségou=-18.4/28.0/71.0 Tabora=-23.3/22.9/68.3 Tabriz=-32.4/12.6/55.6 Taipei=-26.8/23.0/70.1 Tallinn=-39.9/6.4/57.4 Tamale=-18.1/27.8/77.0 Tamanrasset=-22.2/21.7/66.9 Tampa=-21.9/22.8/69.5 Tashkent=-40.1/14.8/60.7 Tauranga=-28.8/14.8/57.8 Tbilisi=-37.6/12.9/60.5 Tegucigalpa=-19.9/21.7/69.5 Tehran=-30.4/17.0/62.9 Tel Aviv=-25.1/20.0/64.7 Thessaloniki=-29.2/16.0/58.7 Thiès=-23.2/23.9/68.1 Tijuana=-36.8/17.8/60.9 Timbuktu=-17.6/28.0/79.4 Tirana=-30.9/15.2/59.6 Toamasina=-19.1/23.4/71.5 Tokyo=-33.1/15.4/61.8 Toliara=-17.9/24.1/70.8 Toluca=-30.2/12.4/54.3 Toronto=-37.3/9.4/57.0 Tripoli=-24.3/20.0/67.7 Tromsø=-45.8/2.9/49.3 Tucson=-26.5/20.9/63.6 Tunis=-27.3/18.4/69.2 Ulaanbaatar=-43.0/-0.4/51.9 Upington=-28.7/20.4/65.1 Vaduz=-32.3/10.1/58.5 Valencia=-25.2/18.3/61.6 Valletta=-24.7/18.8/60.1 Vancouver=-34.8/10.4/58.6 Veracruz=-19.0/25.3/74.8 Vienna=-31.7/10.4/51.9 Vientiane=-16.8/25.8/74.6 Villahermosa=-15.2/27.1/73.7 Vilnius=-35.7/6.0/54.1 Virginia Beach=-28.7/15.8/61.2 Vladivostok=-38.6/4.9/50.3 Warsaw=-36.9/8.5/56.9 Washington, D.C.=-36.2/14.6/61.2 Wau=-18.9/27.8/72.2 Wellington=-35.2/12.9/61.3 Whitehorse=-45.5/-0.1/44.4 Wichita=-31.0/13.8/67.7 Willemstad=-15.2/28.0/73.8 Winnipeg=-40.7/3.0/52.7 Wrocław=-33.6/9.6/57.4 Xi\u0026#39;an=-27.7/14.1/57.1 Yakutsk=-56.0/-8.8/39.7 Yangon=-20.2/27.5/76.9 Yaoundé=-19.8/23.8/68.6 Yellowknife=-46.6/-4.3/41.0 Yerevan=-36.8/12.4/54.3 Yinchuan=-36.8/9.0/53.0 Zagreb=-33.1/10.7/55.5 Zanzibar City=-22.2/25.9/66.9 Zürich=-34.7/9.3/53.0 Ürümqi=-37.8/7.4/49.4 İzmir=-26.5/17.9/61.3 INFO:brc.util:parallel \u0026lt;\u0026lt;\u0026lt; Elapsed time: 0:00:27.071418 INFO:brc.util:parallel \u0026lt;\u0026lt;\u0026lt; with 100,000,000 rows =\u0026gt; 3,693,932.807 rows/s. brc complete est 0:04:30.714183 30 seconds isn\u0026rsquo;t bad for 100mm records in Python, not amazing but it over halved the time from the non-multiprocess solution. It is however less of an improvement than I would have liked; only ~3x records processed per second with 8 more cores really should be a much larger improvement. This will also leave us with a very long time for the full billion too.\n553270 function calls (553255 primitive calls) in 29.698 seconds Ordered by: cumulative time List reduced from 280 to 5 due to restriction \u0026lt;5\u0026gt; ncalls tottime percall cumtime percall filename:lineno(function) 249 0.001 0.000 60.440 0.243 /Users/toby.devlin/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/pool.py:500(_wait_for_updates) 39 0.029 0.001 35.980 0.923 /Users/toby.devlin/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/connection.py:201(send) 44 0.000 0.000 31.395 0.714 /Users/toby.devlin/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/connection.py:389(_send_bytes) 75 0.000 0.000 31.394 0.419 /Users/toby.devlin/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/connection.py:364(_send) 107 2.057 0.019 31.394 0.293 {built-in method posix.write} Looking at the profile We can see most of the code time has migrated to waiting for updates and sending data to and from each child process. This approach is taking a hacksaw to the problem and brute-forcing the code to run in more places. It\u0026rsquo;s not a bad approach, but now we would have to optimise the heebie-jeebies of Python multiprocessing, which would be rather technical. One problem with this approach is that we haven\u0026rsquo;t fine-tuned the single process first; which leads to just slamming the inefficient process.\nAnother approach to this multiprocessing module would be to try leveraging some of the python\u0026rsquo;s shared state tools, such as the Value, Array, or in this case, a Manager would be best. (We could also look into using a Queue but this isn\u0026rsquo;t really the right problem). In the end, this will likely end up still having the same problem as before; python objects are expensive to send and receive across processes.\n3 - Change the Data Pipeline # As part 2 showed were still wasting a lot of time passing data back and forth across Python processes and we hadn\u0026rsquo;t really thought of the underlying problem. This can be improved by partitioning the problem and allowing each process to reach the file at the same time then do its processing for its chunk then return the result. This allows us to focus on optimising a single partition flow and then distributing that to multiple cores, it will also allow us to reduce the data we send back and forth between the processes.\nimport logging import os from collections import defaultdict from functools import reduce from itertools import chain, pairwise from multiprocessing import Pool from pathlib import Path from brc.util import DATA_DIR, timit_context def do_parse(line: bytes): # this is basically O(n) every time. # todo: no improvement possible for parsing in python? city, temp = line.split(b\u0026#34;;\u0026#34;, maxsplit=1) temp = float(temp) return city, temp result_tuple = tuple[float, float, float, int] def collect_part(file_path: Path, start: int, end: int) -\u0026gt; dict[bytes, result_tuple]: data = defaultdict(list) part_data = read_file_part(file_path, start, end) for part in part_data: city, temp = do_parse(part) # todo: .append() is still slow, change this list approach somehow? data[city].append(temp) return { city: (min(items), max(items), sum(items), len(items)) for city, items in data.items() } def merge_result(one: dict[bytes, result_tuple], two: dict[bytes, result_tuple]): # todo: can this be optimised further? for k, two_v in two.items(): # if we have a common data, do the partial sum if k in one.keys(): one_v = one[k] one[k] = ( min(one_v[0], two_v[0]), max(one_v[1], two_v[1]), one_v[2] + two_v[2], one_v[3] + two_v[3] ) else: # if we don\u0026#39;t have a common data, overwrite one[k] = two_v return one def read_file_part(file_path: Path, start: int, end: int) -\u0026gt; list[bytes]: \u0026#34;\u0026#34;\u0026#34; Reads in the files bytes from a start position to the end position \u0026#34;\u0026#34;\u0026#34; with open(file_path, \u0026#39;rb\u0026#39;) as f: f.seek(start) # we provide the offset form our data return f.readlines(end - start) def find_next_newline(file_path: Path, start: int) -\u0026gt; int: \u0026#34;\u0026#34;\u0026#34; Finds the next newline in the file after the position given. \u0026#34;\u0026#34;\u0026#34; # special case if were at the start of the file if start == 0: return 0 offset = start + 1 with open(file_path, \u0026#39;rb\u0026#39;) as f: # move to start place f.seek(start) # read byte by byte til a newline is found while f.read(1) != b\u0026#39;\\n\u0026#39;: offset += 1 return offset def main(file_path: Path): file_size = os.path.getsize(file_path) n_splits = 5000 split_size = file_size // n_splits partitions = (find_next_newline(file_path, n * split_size) for n in range(n_splits)) # extend partitions with the end of the file. partitions = chain(partitions, (file_size,)) # single threaded version for profiling # results = [] # for i, j in pairwise(partitions): # results.append(collect_part(file_path, i, j - 1)) # multithreading version distributing the loop/generator with Pool(8) as pool: partitions_ = ((file_path, i, j - 1) for i, j in pairwise(partitions)) results = pool.starmap(collect_part, partitions_, 10) parts = reduce(merge_result, results) sorted_items = sorted(parts.items()) print(\u0026#34; \u0026#34;.join(f\u0026#34;{k.decode(\u0026#34;utf-8\u0026#34;)}={p[0]}/{p[2] / p[3]:.1f}/{p[1]}\u0026#34; for k, p in sorted_items)) if __name__ == \u0026#34;__main__\u0026#34;: logging.basicConfig(level=logging.INFO) n = 1_000_000_000 path = DATA_DIR / f\u0026#34;data_{n:_}.txt\u0026#34; with timit_context(n, f\u0026#34;parallel\u0026#34;): main(path) The single process version produces the following result for 100mm rows:\nINFO:brc.util:parallel2 \u0026gt;\u0026gt;\u0026gt; Starting timer Abha=-24.7/18.0/64.2 Abidjan=-21.7/26.0/75.3 Abéché=-18.3/29.4/71.0 Accra=-17.0/26.4/75.2 Addis Ababa=-32.3/16.0/62.3 Adelaide=-24.6/17.3/67.1 Aden=-15.0/29.1/79.9 Ahvaz=-16.6/25.4/69.0 Albuquerque=-30.7/14.0/57.3 Alexandra=-29.9/11.0/59.4 Alexandria=-26.4/20.0/68.2 Algiers=-28.0/18.2/75.1 Alice Springs=-24.9/21.0/64.4 Almaty=-35.0/10.0/59.4 Amsterdam=-36.3/10.2/55.2 Anadyr=-53.0/-6.9/38.5 Anchorage=-41.4/2.8/48.7 Andorra la Vella=-34.0/9.8/52.5 Ankara=-33.9/12.0/57.9 Antananarivo=-27.4/17.9/63.5 Antsiranana=-27.8/25.2/71.0 Arkhangelsk=-46.6/1.3/47.2 Ashgabat=-24.9/17.1/58.7 Asmara=-29.9/15.6/59.1 Assab=-14.3/30.5/77.0 Astana=-41.0/3.5/48.5 Athens=-27.0/19.2/60.9 Atlanta=-28.9/17.0/64.1 Auckland=-34.3/15.2/60.9 Austin=-27.8/20.7/64.5 Baghdad=-19.7/22.8/71.7 Baguio=-32.6/19.5/62.6 Baku=-31.6/15.1/56.7 Baltimore=-33.6/13.1/59.0 Bamako=-17.8/27.8/72.4 Bangkok=-15.4/28.6/77.0 Bangui=-20.0/26.0/68.1 Banjul=-25.3/26.0/69.6 Barcelona=-26.9/18.2/60.9 Bata=-21.1/25.1/72.3 Batumi=-28.4/14.0/58.0 Beijing=-33.8/12.9/57.1 Beirut=-23.1/20.9/71.3 Belgrade=-36.3/12.5/57.4 Belize City=-17.6/26.7/72.0 Benghazi=-25.8/19.9/65.7 Bergen=-36.0/7.7/52.9 Berlin=-33.8/10.3/56.0 Bilbao=-30.1/14.7/62.1 Birao=-18.1/26.5/68.8 Bishkek=-35.1/11.2/55.8 Bissau=-18.1/27.0/67.8 Blantyre=-24.9/22.2/67.6 Bloemfontein=-28.6/15.6/58.9 Boise=-31.2/11.4/55.5 Bordeaux=-28.1/14.2/62.1 Bosaso=-13.0/30.0/80.5 Boston=-34.1/10.9/58.4 Bouaké=-20.0/26.0/71.8 Bratislava=-32.7/10.5/56.9 Brazzaville=-22.7/25.0/68.6 Bridgetown=-26.5/27.0/75.3 Brisbane=-24.2/21.4/65.7 Brussels=-36.0/10.5/56.7 Bucharest=-33.6/10.8/57.6 Budapest=-32.0/11.3/54.4 Bujumbura=-20.4/23.8/66.7 Bulawayo=-23.3/18.9/62.5 Burnie=-28.4/13.1/55.7 Busan=-31.2/15.0/59.8 Cabo San Lucas=-18.8/23.9/69.9 Cairns=-21.4/25.0/68.7 Cairo=-26.3/21.4/65.1 Calgary=-43.2/4.4/51.9 Canberra=-31.3/13.1/53.5 Cape Town=-26.2/16.2/61.6 Changsha=-26.7/17.4/61.8 Charlotte=-31.8/16.1/62.2 Chiang Mai=-27.9/25.8/73.8 Chicago=-34.0/9.8/57.3 Chihuahua=-22.9/18.6/66.5 Chittagong=-18.0/25.9/69.4 Chișinău=-33.2/10.2/53.7 Chongqing=-28.9/18.6/64.7 Christchurch=-32.5/12.2/55.1 City of San Marino=-33.0/11.8/54.9 Colombo=-14.7/27.4/70.1 Columbus=-33.4/11.7/58.3 Conakry=-22.4/26.4/74.4 Copenhagen=-39.0/9.1/53.6 Cotonou=-19.5/27.2/73.9 Cracow=-33.1/9.3/56.8 Da Lat=-28.3/17.9/62.7 Da Nang=-17.3/25.8/68.8 Dakar=-19.6/24.0/70.2 Dallas=-32.3/19.0/64.6 Damascus=-31.3/17.0/60.5 Dampier=-16.7/26.4/72.9 Dar es Salaam=-17.6/25.8/69.0 Darwin=-17.3/27.6/75.4 Denpasar=-20.9/23.7/71.2 Denver=-34.0/10.4/52.7 Detroit=-36.5/10.0/54.7 Dhaka=-18.7/25.9/75.9 Dikson=-54.8/-11.1/35.7 Dili=-19.4/26.6/73.0 Djibouti=-15.8/30.0/74.0 Dodoma=-25.4/22.7/71.5 Dolisie=-20.8/24.0/78.6 Douala=-18.2/26.7/70.5 Dubai=-17.4/26.9/71.4 Dublin=-37.7/9.8/55.1 Dunedin=-35.6/11.1/59.0 Durban=-24.3/20.6/65.0 Dushanbe=-29.2/14.7/58.0 Edinburgh=-42.8/9.3/53.5 Edmonton=-37.4/4.2/47.2 El Paso=-27.3/18.1/59.9 Entebbe=-24.0/21.0/67.6 Erbil=-25.5/19.5/67.4 Erzurum=-40.0/5.1/48.5 Fairbanks=-47.3/-2.3/46.9 Fianarantsoa=-26.0/17.9/68.7 Flores, Petén=-17.9/26.4/72.2 Frankfurt=-32.5/10.6/56.5 Fresno=-28.5/17.9/62.4 Fukuoka=-28.5/17.0/63.1 Gaborone=-28.0/21.0/62.9 Gabès=-25.3/19.5/62.2 Gagnoa=-17.9/26.0/69.4 Gangtok=-28.0/15.2/60.9 Garissa=-14.1/29.3/77.3 Garoua=-17.2/28.3/73.2 George Town=-16.4/27.9/74.8 Ghanzi=-27.5/21.4/64.8 Gjoa Haven=-58.1/-14.4/27.9 Guadalajara=-23.8/20.9/63.1 Guangzhou=-24.8/22.4/69.7 Guatemala City=-21.5/20.4/68.1 Halifax=-35.7/7.5/53.2 Hamburg=-35.4/9.7/54.8 Hamilton=-27.2/13.8/58.1 Hanga Roa=-26.0/20.5/64.3 Hanoi=-21.3/23.6/68.5 Harare=-26.5/18.4/62.1 Harbin=-39.7/5.0/51.1 Hargeisa=-23.7/21.7/64.8 Hat Yai=-21.3/27.0/72.8 Havana=-19.0/25.2/75.2 Helsinki=-36.4/5.9/52.7 Heraklion=-25.4/18.9/61.8 Hiroshima=-27.8/16.3/61.3 Ho Chi Minh City=-18.2/27.4/72.1 Hobart=-33.6/12.7/59.0 Hong Kong=-35.6/23.3/68.1 Honiara=-19.2/26.5/68.9 Honolulu=-18.1/25.4/73.0 Houston=-33.4/20.8/62.3 Ifrane=-37.2/11.4/55.5 Indianapolis=-29.2/11.8/57.1 Iqaluit=-53.8/-9.3/35.9 Irkutsk=-46.5/1.0/44.6 Istanbul=-31.4/13.9/57.1 Jacksonville=-30.8/20.3/64.4 Jakarta=-20.7/26.7/75.4 Jayapura=-16.5/27.0/73.3 Jerusalem=-28.8/18.3/61.3 Johannesburg=-34.8/15.5/65.7 Jos=-28.8/22.8/67.9 Juba=-15.9/27.8/71.0 Kabul=-36.0/12.1/55.8 Kampala=-27.2/20.0/60.9 Kandi=-18.0/27.7/71.1 Kankan=-21.1/26.5/69.7 Kano=-17.0/26.4/77.7 Kansas City=-34.4/12.5/56.0 Karachi=-20.0/26.1/72.6 Karonga=-20.6/24.4/70.9 Kathmandu=-24.6/18.3/61.9 Khartoum=-18.2/29.9/75.5 Kingston=-19.3/27.4/69.8 Kinshasa=-22.8/25.3/72.3 Kolkata=-20.8/26.7/72.2 Kuala Lumpur=-20.9/27.3/74.6 Kumasi=-17.5/26.0/70.2 Kunming=-32.8/15.7/58.4 Kuopio=-39.3/3.4/47.6 Kuwait City=-21.4/25.7/69.9 Kyiv=-39.3/8.4/53.3 Kyoto=-28.0/15.8/60.4 La Ceiba=-23.4/26.2/73.7 La Paz=-21.0/23.7/70.2 Lagos=-19.1/26.8/71.2 Lahore=-18.5/24.3/66.0 Lake Havasu City=-27.3/23.7/69.6 Lake Tekapo=-33.8/8.7/49.9 Las Palmas de Gran Canaria=-22.6/21.2/70.3 Las Vegas=-21.9/20.3/65.4 Launceston=-33.1/13.1/57.9 Lhasa=-39.3/7.6/50.8 Libreville=-28.0/25.9/69.4 Lisbon=-28.0/17.5/60.4 Livingstone=-25.6/21.8/66.4 Ljubljana=-37.7/10.9/54.7 Lodwar=-16.8/29.3/80.2 Lomé=-18.6/26.9/68.5 London=-40.4/11.3/55.8 Los Angeles=-29.5/18.6/61.0 Louisville=-32.2/13.9/59.1 Luanda=-16.9/25.8/72.2 Lubumbashi=-27.2/20.8/66.9 Lusaka=-24.0/19.9/65.5 Luxembourg City=-36.7/9.3/51.7 Lviv=-34.3/7.8/57.1 Lyon=-37.4/12.5/58.5 Madrid=-29.9/15.0/59.5 Mahajanga=-22.2/26.3/70.4 Makassar=-17.1/26.7/72.3 Makurdi=-22.2/26.0/73.5 Malabo=-18.8/26.3/71.3 Malé=-13.9/28.0/75.5 Managua=-14.7/27.3/73.1 Manama=-18.8/26.5/72.8 Mandalay=-18.7/28.0/74.7 Mango=-15.7/28.1/80.4 Manila=-16.2/28.4/79.0 Maputo=-26.1/22.8/71.1 Marrakesh=-25.2/19.6/62.5 Marseille=-28.0/15.8/63.9 Maun=-23.1/22.4/68.5 Medan=-20.7/26.5/75.0 Mek\u0026#39;ele=-22.9/22.7/64.3 Melbourne=-32.6/15.1/60.9 Memphis=-24.0/17.2/60.4 Mexicali=-20.2/23.1/70.7 Mexico City=-26.7/17.5/61.6 Miami=-18.2/24.9/71.1 Milan=-34.5/12.9/58.2 Milwaukee=-35.3/8.9/55.5 Minneapolis=-40.2/7.8/51.1 Minsk=-37.7/6.7/50.7 Mogadishu=-26.3/27.1/72.3 Mombasa=-19.6/26.3/69.7 Monaco=-29.6/16.4/62.9 Moncton=-40.1/6.1/47.4 Monterrey=-24.0/22.3/72.7 Montreal=-40.8/6.8/53.3 Moscow=-36.8/5.8/51.0 Mumbai=-14.3/27.1/71.7 Murmansk=-44.2/0.6/51.2 Muscat=-17.8/28.0/73.1 Mzuzu=-34.9/17.7/61.8 N\u0026#39;Djamena=-13.8/28.3/72.5 Naha=-21.1/23.1/71.9 Nairobi=-28.7/17.8/62.3 Nakhon Ratchasima=-31.1/27.3/69.4 Napier=-31.7/14.6/58.4 Napoli=-30.2/15.9/64.8 Nashville=-30.3/15.4/58.6 Nassau=-18.1/24.6/66.3 Ndola=-24.0/20.3/70.7 New Delhi=-18.9/25.0/68.7 New Orleans=-23.3/20.7/71.7 New York City=-29.1/12.9/55.5 Ngaoundéré=-21.7/22.0/65.4 Niamey=-14.6/29.3/75.3 Nicosia=-23.9/19.7/63.2 Niigata=-36.6/13.9/61.5 Nouadhibou=-25.4/21.3/66.4 Nouakchott=-16.6/25.7/73.6 Novosibirsk=-44.5/1.7/45.8 Nuuk=-50.4/-1.4/44.2 Odesa=-33.7/10.7/56.9 Odienné=-18.0/26.0/70.1 Oklahoma City=-33.1/15.9/57.5 Omaha=-34.4/10.6/54.4 Oranjestad=-15.9/28.1/74.9 Oslo=-36.5/5.7/53.2 Ottawa=-42.3/6.6/53.2 Ouagadougou=-20.4/28.3/72.6 Ouahigouya=-17.6/28.6/81.3 Ouarzazate=-23.8/18.9/62.6 Oulu=-41.3/2.7/49.8 Palembang=-17.9/27.4/69.5 Palermo=-30.1/18.5/64.7 Palm Springs=-17.5/24.5/71.0 Palmerston North=-34.5/13.2/57.7 Panama City=-17.0/28.0/75.4 Parakou=-23.1/26.8/70.4 Paris=-30.4/12.3/58.0 Perth=-27.3/18.7/68.1 Petropavlovsk-Kamchatsky=-46.8/1.9/55.8 Philadelphia=-42.3/13.2/60.0 Phnom Penh=-15.1/28.3/75.1 Phoenix=-19.4/23.9/68.4 Pittsburgh=-36.9/10.8/54.9 Podgorica=-27.8/15.3/59.0 Pointe-Noire=-17.5/26.1/71.9 Pontianak=-18.5/27.7/73.4 Port Moresby=-21.2/26.9/71.1 Port Sudan=-16.9/28.4/76.2 Port Vila=-20.3/24.3/78.5 Port-Gentil=-17.9/26.0/72.1 Portland (OR)=-32.4/12.4/58.2 Porto=-28.6/15.7/66.9 Prague=-39.9/8.4/54.1 Praia=-24.9/24.5/67.9 Pretoria=-25.7/18.2/65.2 Pyongyang=-37.2/10.8/58.5 Rabat=-29.0/17.2/68.8 Rangpur=-19.4/24.4/71.2 Reggane=-20.3/28.3/74.2 Reykjavík=-45.8/4.3/48.5 Riga=-40.5/6.2/49.2 Riyadh=-21.0/26.0/74.9 Rome=-28.8/15.2/60.5 Roseau=-27.9/26.2/74.2 Rostov-on-Don=-34.1/9.9/56.6 Sacramento=-29.2/16.3/71.8 Saint Petersburg=-41.4/5.9/55.0 Saint-Pierre=-36.7/5.7/56.6 Salt Lake City=-33.3/11.6/57.8 San Antonio=-22.4/20.8/65.3 San Diego=-29.5/17.8/64.8 San Francisco=-32.1/14.6/57.1 San Jose=-28.6/16.4/59.2 San José=-20.3/22.6/72.2 San Juan=-20.3/27.2/74.4 San Salvador=-24.7/23.1/75.7 Sana\u0026#39;a=-24.3/20.0/64.5 Santo Domingo=-18.1/25.9/72.0 Sapporo=-37.9/8.9/54.9 Sarajevo=-35.5/10.1/52.3 Saskatoon=-39.3/3.3/49.4 Seattle=-35.7/11.3/56.7 Seoul=-30.8/12.5/60.5 Seville=-25.6/19.2/62.9 Shanghai=-28.9/16.7/63.3 Singapore=-24.4/27.0/75.9 Skopje=-32.6/12.4/58.4 Sochi=-31.0/14.2/58.3 Sofia=-36.0/10.6/55.2 Sokoto=-14.3/28.0/77.0 Split=-29.6/16.1/59.9 St. John\u0026#39;s=-39.7/5.0/52.4 St. Louis=-33.0/13.9/60.5 Stockholm=-37.2/6.6/50.5 Surabaya=-19.1/27.1/73.8 Suva=-17.0/25.6/73.7 Suwałki=-36.5/7.2/49.3 Sydney=-27.2/17.7/61.4 Ségou=-18.4/28.0/71.0 Tabora=-23.3/23.0/68.3 Tabriz=-32.4/12.6/55.6 Taipei=-26.8/23.0/70.1 Tallinn=-39.9/6.4/57.4 Tamale=-18.1/27.9/77.0 Tamanrasset=-22.2/21.7/66.9 Tampa=-21.9/22.9/69.5 Tashkent=-40.1/14.8/60.7 Tauranga=-28.8/14.8/57.8 Tbilisi=-37.6/12.9/60.5 Tegucigalpa=-19.9/21.7/69.5 Tehran=-30.4/17.0/62.9 Tel Aviv=-25.1/20.0/64.7 Thessaloniki=-29.2/16.0/58.7 Thiès=-23.2/24.0/68.1 Tijuana=-36.8/17.8/60.9 Timbuktu=-17.6/28.0/79.4 Tirana=-30.9/15.2/59.6 Toamasina=-19.1/23.4/71.5 Tokyo=-33.1/15.4/61.8 Toliara=-17.9/24.1/70.8 Toluca=-30.2/12.4/54.3 Toronto=-37.3/9.4/57.0 Tripoli=-24.3/20.0/67.7 Tromsø=-45.8/2.9/49.3 Tucson=-26.5/20.9/63.6 Tunis=-27.3/18.4/69.2 Ulaanbaatar=-43.0/-0.4/51.9 Upington=-28.7/20.4/65.1 Vaduz=-32.3/10.1/58.5 Valencia=-25.2/18.3/61.6 Valletta=-24.7/18.8/60.1 Vancouver=-34.8/10.4/58.6 Veracruz=-19.0/25.4/74.8 Vienna=-31.7/10.4/51.9 Vientiane=-16.8/25.9/74.6 Villahermosa=-15.2/27.1/73.7 Vilnius=-35.7/6.0/54.1 Virginia Beach=-28.7/15.8/61.2 Vladivostok=-38.6/4.9/50.3 Warsaw=-36.9/8.5/56.9 Washington, D.C.=-36.2/14.6/61.2 Wau=-18.9/27.8/72.2 Wellington=-35.2/12.9/61.3 Whitehorse=-45.5/-0.1/44.4 Wichita=-31.0/13.9/67.7 Willemstad=-15.2/28.0/73.8 Winnipeg=-40.7/3.0/52.7 Wrocław=-33.6/9.6/57.4 Xi\u0026#39;an=-27.7/14.1/57.1 Yakutsk=-56.0/-8.8/39.7 Yangon=-20.2/27.5/76.9 Yaoundé=-19.8/23.8/68.6 Yellowknife=-46.6/-4.3/41.0 Yerevan=-36.8/12.4/54.3 Yinchuan=-36.8/9.0/53.0 Zagreb=-33.1/10.7/55.5 Zanzibar City=-22.2/26.0/66.9 Zürich=-34.7/9.3/53.0 Ürümqi=-37.8/7.4/49.4 İzmir=-26.5/17.9/61.3 INFO:brc.util:parallel2 \u0026lt;\u0026lt;\u0026lt; Elapsed time: 0:00:37.261046 INFO:brc.util:parallel2 \u0026lt;\u0026lt;\u0026lt; with 100,000,000 rows =\u0026gt; 2,683,767.901 rows/s. brc complete est 0:06:12.610463 An estimate of just over 6 minutes for a billion rows is decent for Python and beat our first attempt for 100mm by half. Also profiling a single-threaded version is much easier to understand:\n302615043 function calls (302615041 primitive calls) in 87.055 seconds Ordered by: cumulative time List reduced from 51 to 5 due to restriction \u0026lt;5\u0026gt; ncalls tottime percall cumtime percall filename:lineno(function) 1 1.793 1.793 87.055 87.055 /Users/toby.devlin/dev/projects/1brc/src/brc/challenge_parallel_2.py:91(main) 999 27.433 0.027 84.822 0.085 /Users/toby.devlin/dev/projects/1brc/src/brc/challenge_parallel_2.py:33(collect_part) 99899999 24.782 0.000 39.515 0.000 /Users/toby.devlin/dev/projects/1brc/src/brc/challenge_parallel_2.py:21(do_parse) 99899999 14.733 0.000 14.733 0.000 {method \u0026#39;split\u0026#39; of \u0026#39;bytes\u0026#39; objects} 99900998 9.070 0.000 9.070 0.000 {method \u0026#39;append\u0026#39; of \u0026#39;list\u0026#39; objects} This shows the slow bits are still down to Python builtin calls (split and append) which in their cases are because of the data structures we use. Changing the processing would be the solution to reduce this time, however, these are well-designed for the problem so some out-of-the-box thinking with lower-level stuff will be needed for these numbers to improve. I find that probably out of scope for \u0026ldquo;pure Python\u0026rdquo;.\nThe multiprocessing result is, surprisingly, much faster than expected, even for 1 billion records!\nINFO:brc.util:parallel \u0026gt;\u0026gt;\u0026gt; Starting timer profile=False Abha=-31.4/18.0/67.0 Abidjan=-26.3/26.0/75.2 Abéché=-23.3/29.4/79.7 Accra=-21.8/26.4/74.9 Addis Ababa=-31.6/16.0/63.5 Adelaide=-42.0/17.3/71.0 Aden=-18.2/29.1/80.3 Ahvaz=-23.9/25.4/75.4 Albuquerque=-34.7/14.0/64.6 Alexandra=-40.1/11.0/60.8 Alexandria=-26.7/20.0/75.5 Algiers=-32.1/18.2/69.0 Alice Springs=-27.2/21.0/69.9 Almaty=-40.9/10.0/61.2 Amsterdam=-38.4/10.2/59.4 Anadyr=-59.4/-6.9/44.8 Anchorage=-51.0/2.8/54.2 Andorra la Vella=-40.6/9.8/59.3 Ankara=-36.9/12.0/63.1 Antananarivo=-36.2/17.9/67.9 Antsiranana=-24.8/25.2/78.0 Arkhangelsk=-49.5/1.3/52.2 Ashgabat=-35.1/17.1/68.6 Asmara=-32.8/15.6/64.8 Assab=-20.6/30.5/80.5 Astana=-46.6/3.5/50.9 Athens=-31.1/19.2/68.3 Atlanta=-37.0/17.0/66.6 Auckland=-34.0/15.2/67.0 Austin=-32.6/20.7/70.0 Baghdad=-30.5/22.8/71.3 Baguio=-29.4/19.5/67.9 Baku=-32.9/15.1/63.9 Baltimore=-34.3/13.1/62.9 Bamako=-20.0/27.8/80.2 Bangkok=-20.5/28.6/76.5 Bangui=-24.1/26.0/75.0 Banjul=-23.8/26.0/72.0 Barcelona=-35.3/18.2/70.4 Bata=-22.7/25.1/74.2 Batumi=-33.6/14.0/64.4 Beijing=-36.1/12.9/66.3 Beirut=-29.6/20.9/70.0 Belgrade=-40.9/12.5/60.0 Belize City=-21.6/26.7/77.2 Benghazi=-29.3/19.9/71.9 Bergen=-40.8/7.7/59.0 Berlin=-38.7/10.3/60.8 Bilbao=-37.2/14.7/63.8 Birao=-23.8/26.5/78.9 Bishkek=-39.8/11.3/61.8 Bissau=-27.2/27.0/77.9 Blantyre=-26.4/22.2/75.1 Bloemfontein=-34.2/15.6/64.4 Boise=-36.0/11.4/65.7 Bordeaux=-44.7/14.2/62.3 Bosaso=-17.7/30.0/77.3 Boston=-36.8/10.9/57.7 Bouaké=-24.3/26.0/74.5 Bratislava=-38.4/10.5/62.8 Brazzaville=-24.1/25.0/73.5 Bridgetown=-20.0/27.0/78.7 Brisbane=-31.2/21.4/72.6 Brussels=-38.9/10.5/59.9 Bucharest=-35.3/10.8/64.3 Budapest=-36.5/11.3/62.1 Bujumbura=-32.7/23.8/70.4 Bulawayo=-29.8/18.9/66.8 Burnie=-38.5/13.1/68.0 Busan=-35.5/15.0/61.5 Cabo San Lucas=-24.8/23.9/77.8 Cairns=-29.9/25.0/76.7 Cairo=-28.0/21.4/71.8 Calgary=-46.8/4.4/54.4 Canberra=-35.6/13.1/60.7 Cape Town=-38.6/16.2/62.9 Changsha=-29.1/17.4/66.0 Charlotte=-32.5/16.1/63.1 Chiang Mai=-25.6/25.8/76.4 Chicago=-39.0/9.8/61.1 Chihuahua=-32.0/18.6/70.4 Chittagong=-30.0/25.9/76.8 Chișinău=-40.5/10.2/58.1 Chongqing=-34.4/18.6/67.3 Christchurch=-39.2/12.2/61.2 City of San Marino=-38.8/11.8/61.4 Colombo=-19.9/27.4/81.3 Columbus=-34.9/11.7/61.2 Conakry=-24.4/26.4/75.2 Copenhagen=-39.6/9.1/57.2 Cotonou=-23.4/27.2/78.0 Cracow=-40.2/9.3/62.0 Da Lat=-31.8/17.9/68.7 Da Nang=-24.0/25.8/74.3 Dakar=-24.0/24.0/72.7 Dallas=-31.0/19.0/66.7 Damascus=-33.4/17.0/69.7 Dampier=-23.8/26.4/77.3 Dar es Salaam=-24.8/25.8/79.6 Darwin=-22.0/27.6/81.9 Denpasar=-23.8/23.7/73.8 Denver=-38.1/10.4/62.3 Detroit=-36.2/10.0/58.3 Dhaka=-23.8/25.9/77.1 Dikson=-56.5/-11.1/40.8 Dili=-22.6/26.6/73.3 Djibouti=-24.2/29.9/81.2 Dodoma=-25.5/22.7/74.5 Dolisie=-27.3/24.0/74.0 Douala=-19.2/26.7/77.6 Dubai=-20.6/26.9/78.3 Dublin=-39.1/9.8/59.9 Dunedin=-37.4/11.1/61.9 Durban=-29.3/20.6/68.7 Dushanbe=-37.8/14.7/63.6 Edinburgh=-38.3/9.3/61.7 Edmonton=-46.1/4.2/53.2 El Paso=-35.7/18.1/69.3 Entebbe=-30.8/21.0/80.1 Erbil=-30.6/19.5/68.4 Erzurum=-44.7/5.1/57.5 Fairbanks=-55.5/-2.3/47.1 Fianarantsoa=-37.1/17.9/66.4 Flores, Petén=-24.1/26.4/74.5 Frankfurt=-37.8/10.6/57.6 Fresno=-38.2/17.9/66.2 Fukuoka=-34.2/17.0/69.1 Gaborone=-29.3/21.0/71.3 Gabès=-35.2/19.5/70.8 Gagnoa=-23.6/26.0/80.2 Gangtok=-36.2/15.2/63.2 Garissa=-22.2/29.3/78.3 Garoua=-25.9/28.3/77.9 George Town=-18.4/27.9/77.8 Ghanzi=-28.6/21.4/69.1 Gjoa Haven=-64.9/-14.4/34.7 Guadalajara=-28.6/20.9/70.7 Guangzhou=-29.4/22.4/76.6 Guatemala City=-28.9/20.4/71.2 Halifax=-43.1/7.5/57.3 Hamburg=-41.7/9.7/59.6 Hamilton=-34.9/13.8/63.4 Hanga Roa=-29.4/20.5/75.5 Hanoi=-24.3/23.6/71.8 Harare=-28.4/18.4/69.0 Harbin=-43.3/5.0/54.7 Hargeisa=-28.4/21.7/72.0 Hat Yai=-21.5/27.0/76.2 Havana=-25.9/25.2/76.8 Helsinki=-45.1/5.9/58.6 Heraklion=-29.7/18.9/70.6 Hiroshima=-34.5/16.3/67.5 Ho Chi Minh City=-23.2/27.4/80.8 Hobart=-39.6/12.7/63.8 Hong Kong=-25.3/23.3/75.1 Honiara=-24.5/26.5/76.0 Honolulu=-22.1/25.4/79.8 Houston=-32.3/20.8/67.8 Ifrane=-40.1/11.4/62.2 Indianapolis=-39.0/11.8/63.1 Iqaluit=-60.5/-9.3/45.3 Irkutsk=-49.2/1.0/49.9 Istanbul=-33.1/13.9/62.8 Jacksonville=-26.8/20.3/69.8 Jakarta=-25.3/26.7/76.2 Jayapura=-21.7/27.0/74.1 Jerusalem=-30.6/18.3/68.5 Johannesburg=-31.5/15.5/69.1 Jos=-27.6/22.8/71.4 Juba=-26.7/27.8/78.8 Kabul=-35.0/12.1/62.4 Kampala=-30.0/20.0/72.3 Kandi=-24.4/27.7/78.9 Kankan=-20.3/26.5/74.9 Kano=-22.1/26.4/77.1 Kansas City=-34.9/12.5/65.5 Karachi=-25.9/26.0/76.5 Karonga=-26.0/24.4/73.6 Kathmandu=-37.6/18.3/67.6 Khartoum=-21.1/29.9/81.3 Kingston=-24.1/27.4/81.6 Kinshasa=-27.8/25.3/75.0 Kolkata=-25.7/26.7/73.2 Kuala Lumpur=-22.0/27.3/78.3 Kumasi=-21.5/26.0/74.6 Kunming=-32.6/15.7/69.9 Kuopio=-49.4/3.4/51.8 Kuwait City=-25.1/25.7/76.6 Kyiv=-42.1/8.4/57.3 Kyoto=-37.9/15.8/63.3 La Ceiba=-22.1/26.2/81.9 La Paz=-24.0/23.7/72.3 Lagos=-31.0/26.8/77.0 Lahore=-22.9/24.3/78.6 Lake Havasu City=-30.2/23.7/73.0 Lake Tekapo=-43.5/8.7/61.8 Las Palmas de Gran Canaria=-29.5/21.2/70.0 Las Vegas=-29.9/20.3/72.9 Launceston=-34.0/13.1/64.6 Lhasa=-41.8/7.6/58.4 Libreville=-25.0/25.9/77.0 Lisbon=-34.7/17.5/67.3 Livingstone=-30.8/21.8/75.3 Ljubljana=-40.9/10.9/62.0 Lodwar=-27.1/29.3/78.5 Lomé=-21.9/26.9/80.6 London=-38.7/11.3/57.7 Los Angeles=-32.5/18.6/68.5 Louisville=-36.5/13.9/64.1 Luanda=-22.8/25.8/82.6 Lubumbashi=-30.9/20.8/71.3 Lusaka=-31.6/19.9/71.4 Luxembourg City=-44.6/9.3/62.2 Lviv=-40.7/7.8/55.9 Lyon=-37.9/12.5/66.8 Madrid=-36.7/15.0/62.8 Mahajanga=-32.1/26.3/78.5 Makassar=-24.4/26.7/78.3 Makurdi=-27.8/26.0/71.4 Malabo=-26.0/26.3/78.7 Malé=-19.7/28.0/81.3 Managua=-24.2/27.3/76.1 Manama=-26.7/26.5/79.2 Mandalay=-23.0/28.0/78.6 Mango=-22.4/28.1/78.5 Manila=-22.3/28.4/79.9 Maputo=-27.9/22.8/71.4 Marrakesh=-30.8/19.6/69.1 Marseille=-33.4/15.8/67.8 Maun=-28.1/22.4/68.7 Medan=-21.2/26.5/74.5 Mek\u0026#39;ele=-30.1/22.7/72.3 Melbourne=-37.1/15.1/66.4 Memphis=-33.9/17.2/69.7 Mexicali=-26.9/23.1/74.0 Mexico City=-30.9/17.5/74.1 Miami=-28.3/24.9/80.7 Milan=-37.3/13.0/61.8 Milwaukee=-38.7/8.9/58.3 Minneapolis=-41.4/7.8/57.4 Minsk=-45.5/6.7/58.7 Mogadishu=-25.3/27.1/80.7 Mombasa=-25.6/26.3/74.5 Monaco=-32.1/16.4/71.8 Moncton=-45.5/6.1/60.4 Monterrey=-28.2/22.3/72.0 Montreal=-43.6/6.8/57.6 Moscow=-44.4/5.8/55.6 Mumbai=-23.6/27.1/78.6 Murmansk=-47.9/0.6/47.7 Muscat=-22.7/28.0/80.7 Mzuzu=-31.9/17.7/67.9 N\u0026#39;Djamena=-20.6/28.3/77.9 Naha=-33.6/23.1/72.7 Nairobi=-33.3/17.8/66.5 Nakhon Ratchasima=-21.2/27.3/75.4 Napier=-43.3/14.6/66.6 Napoli=-31.8/15.9/68.4 Nashville=-35.0/15.4/62.4 Nassau=-23.5/24.6/76.2 Ndola=-31.2/20.3/67.7 New Delhi=-27.7/25.0/75.9 New Orleans=-26.6/20.7/72.3 New York City=-35.1/12.9/60.3 Ngaoundéré=-28.9/22.0/69.5 Niamey=-18.3/29.3/78.7 Nicosia=-27.6/19.7/73.7 Niigata=-35.1/13.9/63.7 Nouadhibou=-25.0/21.3/76.7 Nouakchott=-23.4/25.7/76.1 Novosibirsk=-48.6/1.7/52.4 Nuuk=-50.0/-1.4/53.2 Odesa=-38.8/10.7/60.1 Odienné=-24.0/26.0/75.6 Oklahoma City=-34.1/15.9/66.4 Omaha=-44.2/10.6/57.8 Oranjestad=-17.6/28.1/77.9 Oslo=-42.5/5.7/55.2 Ottawa=-48.4/6.6/55.4 Ouagadougou=-22.2/28.3/77.1 Ouahigouya=-19.7/28.6/80.5 Ouarzazate=-36.1/18.9/67.5 Oulu=-47.8/2.7/50.5 Palembang=-20.9/27.3/77.3 Palermo=-29.2/18.5/64.5 Palm Springs=-23.1/24.5/71.2 Palmerston North=-34.8/13.2/61.6 Panama City=-22.0/28.0/80.2 Parakou=-22.0/26.8/74.7 Paris=-41.2/12.3/62.9 Perth=-32.9/18.7/68.5 Petropavlovsk-Kamchatsky=-51.3/1.9/53.5 Philadelphia=-36.4/13.2/66.6 Phnom Penh=-22.9/28.3/76.7 Phoenix=-25.7/23.9/75.1 Pittsburgh=-42.2/10.8/61.6 Podgorica=-30.8/15.3/63.3 Pointe-Noire=-26.2/26.1/80.2 Pontianak=-22.6/27.7/79.2 Port Moresby=-24.1/26.9/74.8 Port Sudan=-23.4/28.4/76.4 Port Vila=-25.3/24.3/77.0 Port-Gentil=-23.0/26.0/80.0 Portland (OR)=-36.5/12.4/61.7 Porto=-31.8/15.7/64.9 Prague=-45.7/8.4/56.0 Praia=-26.0/24.4/73.6 Pretoria=-28.6/18.2/72.4 Pyongyang=-37.6/10.8/59.3 Rabat=-34.8/17.2/66.6 Rangpur=-31.0/24.4/74.4 Reggane=-22.1/28.3/80.7 Reykjavík=-44.8/4.3/54.1 Riga=-45.6/6.2/55.2 Riyadh=-28.5/26.0/75.6 Rome=-35.0/15.2/61.4 Roseau=-22.2/26.2/75.2 Rostov-on-Don=-43.9/9.9/60.8 Sacramento=-32.7/16.3/65.7 Saint Petersburg=-41.3/5.8/55.8 Saint-Pierre=-45.5/5.7/54.9 Salt Lake City=-42.0/11.6/59.1 San Antonio=-34.8/20.8/67.3 San Diego=-28.8/17.8/73.3 San Francisco=-32.7/14.6/65.4 San Jose=-33.4/16.4/68.8 San José=-27.1/22.6/71.6 San Juan=-21.9/27.2/76.6 San Salvador=-28.0/23.1/73.6 Sana\u0026#39;a=-29.7/20.0/68.7 Santo Domingo=-29.8/25.9/75.8 Sapporo=-41.4/8.9/57.4 Sarajevo=-38.8/10.1/61.0 Saskatoon=-44.9/3.3/60.7 Seattle=-37.4/11.3/66.6 Seoul=-35.2/12.5/63.2 Seville=-28.4/19.2/66.8 Shanghai=-38.4/16.7/71.4 Singapore=-24.3/27.0/75.5 Skopje=-39.3/12.4/62.6 Sochi=-38.9/14.2/62.4 Sofia=-39.0/10.6/62.5 Sokoto=-26.0/28.0/80.9 Split=-42.5/16.1/65.8 St. John\u0026#39;s=-43.3/5.0/52.7 St. Louis=-39.6/13.9/64.0 Stockholm=-42.7/6.6/55.7 Surabaya=-21.5/27.1/75.4 Suva=-29.3/25.6/76.6 Suwałki=-40.3/7.2/60.9 Sydney=-31.3/17.7/70.7 Ségou=-19.5/28.0/75.7 Tabora=-24.6/23.0/71.2 Tabriz=-36.0/12.6/64.4 Taipei=-26.0/23.0/70.9 Tallinn=-44.1/6.4/57.0 Tamale=-23.6/27.9/81.3 Tamanrasset=-27.1/21.7/70.7 Tampa=-26.3/22.9/78.2 Tashkent=-31.7/14.8/65.4 Tauranga=-33.0/14.8/65.2 Tbilisi=-34.0/12.9/62.8 Tegucigalpa=-32.0/21.7/72.2 Tehran=-37.6/17.0/68.4 Tel Aviv=-34.6/20.0/69.4 Thessaloniki=-31.4/16.0/64.6 Thiès=-25.0/24.0/70.5 Tijuana=-34.2/17.8/68.1 Timbuktu=-22.0/28.0/83.3 Tirana=-36.5/15.2/64.3 Toamasina=-24.4/23.4/71.3 Tokyo=-32.3/15.4/62.8 Toliara=-26.4/24.1/75.9 Toluca=-35.9/12.4/62.8 Toronto=-43.4/9.4/59.0 Tripoli=-27.4/20.0/71.6 Tromsø=-54.1/2.9/55.5 Tucson=-28.1/20.9/70.9 Tunis=-29.0/18.4/67.1 Ulaanbaatar=-47.8/-0.4/47.5 Upington=-35.4/20.4/68.5 Vaduz=-41.1/10.1/57.8 Valencia=-30.2/18.3/66.6 Valletta=-29.8/18.8/69.3 Vancouver=-39.0/10.4/57.7 Veracruz=-23.5/25.4/79.1 Vienna=-40.9/10.4/62.1 Vientiane=-23.9/25.9/75.5 Villahermosa=-21.7/27.1/79.0 Vilnius=-41.5/6.0/56.3 Virginia Beach=-43.4/15.8/66.0 Vladivostok=-43.9/4.9/51.6 Warsaw=-44.2/8.5/54.5 Washington, D.C.=-34.4/14.6/63.6 Wau=-30.3/27.8/77.2 Wellington=-40.2/12.9/63.4 Whitehorse=-52.0/-0.1/57.3 Wichita=-38.1/13.9/66.5 Willemstad=-21.4/28.0/76.1 Winnipeg=-48.3/3.0/51.3 Wrocław=-40.2/9.6/61.4 Xi\u0026#39;an=-34.8/14.1/65.9 Yakutsk=-60.3/-8.8/41.0 Yangon=-21.2/27.5/75.5 Yaoundé=-24.1/23.8/74.1 Yellowknife=-52.2/-4.3/44.8 Yerevan=-39.1/12.4/62.7 Yinchuan=-40.9/9.0/63.3 Zagreb=-42.7/10.7/60.2 Zanzibar City=-21.2/26.0/75.2 Zürich=-41.1/9.3/58.6 Ürümqi=-40.3/7.4/61.7 İzmir=-29.6/17.9/69.3 INFO:brc.util:parallel \u0026lt;\u0026lt;\u0026lt; Elapsed time: 0:01:00.662966 INFO:brc.util:parallel \u0026lt;\u0026lt;\u0026lt; with 1,000,000,000 rows =\u0026gt; 16,484,521.949 rows/s. brc complete est 0:01:00.662966 There are probably some improvements that can be made such as grid searching for n_splits and chunksize but mainly from the comments I have made in the code. Personally, I\u0026rsquo;m well-chuffed with the results I\u0026rsquo;ve achieved with multiprocessing and optimization at this level of Python. The next step would likely start looking at how to solve only using the C bindings (i.e. solve it in C) or rewrite in a faster language.\nI also allowed the Pool() to use all the cores on the machine which resulted in a sub-1min total time of 0:00:55.958772 - beating Polars (for this one very specific simple task) by a couple of seconds!!\nOther Notable Tools # DuckDB # DuckDB took a whole 47 seconds to run the below query, of which loading the CSV took more than 40 seconds. The query itself only took 3.81 seconds on average! This is because the data format on disk took much less time to parse; It really shows the trade-off between having the data in the right format to start with.\nCREATE TABLE brc AS SELECT * FROM read_csv_auto(\u0026#39;/Users/toby.devlin/dev/projects/1brc/data/data_1_000_000_000.txt\u0026#39;); ALTER TABLE brc RENAME column0 TO city; ALTER TABLE brc RENAME column1 TO temp; select city, min(temp), avg(temp), max(temp) from brc group by city As an aside I also tried this with SQLite for shits and gigs; it loaded the records in about an hour and then took over 5 minutes to compute the same query, so I gave up. Local ODBC has been won by DuckDB in this one.\nPolars with Parquet # Step one was dumping the CSV over to a parquet file with a short bit of code.\nimport polars as pl from brc.util import DATA_DIR, timit_context def sink_to_parquet(): n = 1_000_000_000 with timit_context(n, \u0026#34;scan_parquet\u0026#34;): lf = pl.scan_csv( DATA_DIR / \u0026#34;data_1_000_000_000.txt\u0026#34;, has_header=False, separator=\u0026#34;;\u0026#34;, with_column_names=lambda _: [\u0026#34;city\u0026#34;, \u0026#34;temp\u0026#34;], schema={\u0026#34;city\u0026#34;: pl.Categorical, \u0026#34;temp\u0026#34;: pl.Float64} ) with timit_context(n, \u0026#34;sink_parquet\u0026#34;): lf.sink_parquet(DATA_DIR / \u0026#34;data_1_000_000_000.parquet\u0026#34;) with timit_context(n, \u0026#34;sink_parquet_stats\u0026#34;): lf.sink_parquet(DATA_DIR / \u0026#34;data_1_000_000_000_stats.parquet\u0026#34;, statistics=True) This code is part of the benchmark_parquet.py file and source -\u0026gt; sink time was 0:01:15.312101 for the non-stats file and 0:01:24.109170 for the stats to be added and on disk both these files are only 3.6GB! Parquet has much better compression and RLE to save space \u0026amp; is columnar which means it interfaces with Arrow very well (natively by design actually). We can now rerun the benchmark code but pass in the parquet file as the source. This gives the result of 0:01:25.495492 and 0:01:28.913853 respectively, which is fascinating!\nI\u0026rsquo;m probably doing something wrong here and will look into it. I was expecting an order of magnitude speed up.\nChatGPT # I did ask ChatGPT (GPT3.5) for a solution and, very unhelpfully, it provided code that errored and was fundamentally flawed. As with other large code tasks, ChatGPT fell flat; however, it did help with smaller more targeted feature requests and helped me with some of the aggregation code, once I gave it the gist and asked for a refactor. Personally, this approach is much better for feedback and a \u0026ldquo;give me a starting point\u0026rdquo; type\nFinal Results # Single process: 0:06:12.610463 (estimated from 0:00:37.261046 for 100mm) Multi-process: 0:01:00.662966 Multi-process, all cores: 0:00:55.958772 Polars: 0:01:00.263331 Other tools: Very fast once data is loaded. Its official (on my machine) boys and girls, I beat polars in native Python. Although I did spend nearly 60x longer to write it.\nAll the code can be seen on the GitHub project\n","date":"6 January 2024","externalUrl":null,"permalink":"/blog/optimising-parallel-python-and-the-billion-row-challenge/","section":"Blog","summary":"","title":"Optimising Parallel Python and the Billion Row Challenge","type":"blog"},{"content":"","date":"21 September 2022","externalUrl":null,"permalink":"/tags/cloud-native/","section":"Tags","summary":"","title":"Cloud Native","type":"tags"},{"content":"","date":"21 September 2022","externalUrl":null,"permalink":"/tags/kubernetes/","section":"Tags","summary":"","title":"Kubernetes","type":"tags"},{"content":"Kubernetes is a production-ready open-source system for running \u0026amp; orchestrating containers. It manages instances, self-heals, adds abstractions to make networking a breeze and allows multiple logical apps to run on the same logical deployment.\nComponents of the System # Worker nodes can have multiple (100s) pods running. Each node has processes containter runtime, kublet and kube-proxy running. without these runtimes, it will fail - the containter runtime runtime runs any number of Container images within the Pod, such as docker or podman images. kublet interacts with the Containers and the Kubernetes master Pods needed, kube-proxy is responsible for sending requests to other services without impacting network performance.\nThe Master node has similar; 4 services must be running. api server, which is responsible for interacting with the outside world and receiving the metadata changes including validation of requests. the scheduler app which is responsible for actual placement of jobs \u0026amp; pods. It interacts with the kubelet on nodes asking them to start the jobs. controller manager is responsible for understanding the existing layout of the state of the cluster, understanding which jobs have failed or need restarting, i.e. are out outside the resource definition bounds, and sends requests to the scheduler to get the resources spun back up. etcd is a key-value store which acts as the \u0026lsquo;brain\u0026rsquo; of the cluster, interacting with the other services to provide them the information needed to run. It is the stateful part of the k8s cluster.\nResource Types # Below is the major components of an application running on kubernetes. There are many more types of resources k8s provides but these are the major ones. We will be creating an example deployment usingthe Terraform kubernetes provider.\nIts worth noting that all resource definitions in a k8s deployment have a metadata section and a spec section. The metadata is where you can describe information about the resource like name, tags and namespace. The spec is varied for each resource and is worth reading up on the specifics in the k8s docs.\nA note on tagging: Tags are the way k8s interacts with itself, if a Deployment has a set of tags on its pods matching that of an existing Deployment then the new Deployment will interact with those tagged pods, potentially overwriting the original pods.\nDeployment \u0026amp; Stateful Set # These components are for deploying replicas of images. Every deployment consists of a collection of pods, which in turn are collections of containers. Normally a Pod is a logical deployment of your application and has a single container, unless there are required service sidecars such as metric reporters and the like. The control plane schedules the Pods to run on individual Nodes. After the initial apply the control plane manages the Pod instances and makes sure they are up-to-date, ensuring a healthcheck and self-healing when nodes go down.\nOther concepts that are defined alongside the Pod include Container images port exposures \u0026amp; update strategy. Each section in the docs for the Deployment and Stateful Set will describe what can be placed in the spec and how it effects the resource.\nThe difference between a stateful set and a deployment are to do with how the pods interact with volumes via volume claims, this blog post is much more in depth. There is also a DaemonSet object which enables a single Pod to be run on each of the nodes in a cluster, rather than defining application availability it is used more as a cluster management tool.\nlocals { whoamiLabelVal = \u0026#34;whoamiExample\u0026#34; whoamiAppLabelMap = { appname = local.whoamiLabelVal } } resource \u0026#34;kubernetes_deployment\u0026#34; \u0026#34;whoami_example\u0026#34; { metadata { name = \u0026#34;whoami\u0026#34; labels = local.whoamiAppLabelMap } spec { replicas = 5 selector { match_labels = local.whoamiAppLabelMap } template { metadata { labels = local.whoamiAppLabelMap } spec { container { image = \u0026#34;traefik/whoami\u0026#34; name = \u0026#34;whoami-name\u0026#34; port { container_port = 80 } } } } } } Service \u0026amp; Ingress # Services are ways to link together Deployment pods under a single banner. These pods have an IP address which is likely to change, a service binds these under a single Service IP address which balances requests towards the underlying pods. To know which pods to forward the requests to it leverages a selector, a collection of label pairs, and will forward to any pods which match all the labels in the selector. A Service also has a ports section which will define which port to listen and forward requests to. There is a tertiary resource called an Endpoint which is keeps track of the pods matching the tags. The Endpoint is usually managed for you.\nServices can also optionally have multiple ports exposed. In this event we must name the Service ports, and can route requests to another port that exists on the matching pods. For example a metrics collector which will collect information about the main Container and expose this data in an aggregate form.\nThere are 4 types of service, structured in layers:\nCluster IP: The default and base type. Nothing more than the above. However, a Headless Cluster IP is when you want to talk to an individual pod as a service, rather than load balancing. This may be used when needing to talk to asymmetric replicas. Node Port: This service builds on the Cluster IP and adds a port binding to the node itself and across the cluster internally, allowing access to the nodes IP address at the port defined in the service from outside the cluster. This is similar to exposing a docker image to the host IP but across all nodes. Load Balancer: This is another extension of the Node Port Service which provides and externally facing load balancer on the cluster which routes to the same endpoints as the Cluster IP Service. This is similar to the Ingres but doesn\u0026rsquo;t have the same feature set. Ultimately as you move up the service layers you get more and more exposed to the outside world. Thankfully there is another way to create interactions, especially if it is for public access and not for other internal consumption. An Ingress is a layer that sits on the boundary to the outside world and the cluster.\nresource \u0026#34;kubernetes_service\u0026#34; \u0026#34;whoami_service\u0026#34; { metadata { name = \u0026#34;whoami\u0026#34; labels = kubernetes_deployment.whoami_example.metadata[0].labels } spec { # type = \u0026#34;LoadBalancer\u0026#34; selector = kubernetes_deployment.whoami_example.spec[0].selector[0].match_labels port { port = 8080 target_port = kubernetes_deployment.whoami_example.spec[0].template[0].spec[0].container[0].port[0].container_port } } } A note on external to internal traffic: Due to how k8s is inherently self-healing \u0026amp; IPs are constantly changing, to have a nice externally facing IP on a nice port number it\u0026rsquo;s a good idea to have any domain associated IP outside the cluster as a dedicated proxy. This is where an Ingress allows domain routing rules on cluster Services whether an internal service Cluster IP or an external Load Balancer\nWhen routing traffic into a cluster for testing there is an easy way to create an ingres loadbalancer by just port-forwarding requests to an internal service such as the above. By running kubectl port-forward svc/whoami 80:8080 all requests to localhost:80 are forwarded to the services port, allowing connection as if it were a load balancing service.\nIn a production setting you can use an Ingress; a service specifically spun up to take your config and route traffic to the correct resource pods. As per the docs this is a relatively complicated matter \u0026amp; can enable things like TLS Certs and path transforms, depending on your provider.\nIt is recommended to be by a friend that, unless your business is networking, you should go with a more cloud specific approach when designing public ingres to your cluster. This means leveraging a Load Balancer type Service for each logical internal service you might spin up \u0026amp; then point the clouds native load balancer to the Services IP. This allows flexibility to not need to manage that specific networking piece of the stack and allow the potentially more powerful service to manage spikes in traffic, attacks and isolation.\nConfigMap \u0026amp; Secrets # ConfigMaps \u0026amp; Secrets are a way of defining variables that might change the runtime of a piece of code. By updating the data in these resources k8s will change the underlying data on the fly. To refresh the Deployment with the new data, assuming they correct mappings are made in the definitions, is to restart the Pods.\nresource \u0026#34;kubernetes_config_map\u0026#34; \u0026#34;whoami\u0026#34; { immutable = true metadata { name = \u0026#34;whoami\u0026#34; labels = local.whoamiAppLabelMap namespace = kubernetes_namespace.whoami_ns.metadata[0].name } data = { TEST_DATA = 1 WHOAMI_NAME = \u0026#34;a name from config map\u0026#34; } } In terraform the data can be applied to pods by adding the below to the Deployments definition. This will read in and apply all the values of the config map to the containers\ncontainer { # ... env_from { config_map_ref { name = kubernetes_config_map.whoami.metadata[0].name } } # ... } Similar for Secrets, but creating Secrets there is a level of encryption applied to maintain the safety of the data inside. It has the same form as the ConfigMap aboveTo apply this to a Deployment just add the below\ncontainer { # ... env_from { secret_ref { name = kubernetes_secret.whoami.metadata[0].name } } # ... } In my opinion the best way to manage changes to a ConfigMap or Secret is to consider them always immutable. If an update must be made you should swap over to a new version of the resource and force a restart of the Pods.\nVolume # RO Volumes are simple and are useful for mounting things like certificates, secret files or other useful bits of info. Typically, the entry point is already some defined information already stored as a ConfigMap or Secret. With this information you can create volume mounts with the following additions to the Pod \u0026amp; Container definitions within the Deployment:\n#... volume { name = \u0026#34;imasecret\u0026#34; secret { secret_name = kubernetes_secret.whoami.metadata[0].name } } #... container { # # ... volume_mount { mount_path = \u0026#34;/secrets\u0026#34; name = \u0026#34;pod-mount-name\u0026#34; read_only = true } # # ... } #... A note on Pods \u0026amp; Containers: the Pod is the Kubernetes concept, and the Container just a runtime based on docker, podman or whatever. When mounting volumes it must first be associated with the Pod, then the Container . Its, like hotel - you must let it into the outer room (lobby) first before allowing it into the inner rooms.\nRW Volumes are less important with regular Deployments then Stateful Sets, typically you should only use them if you need to persist something locally as a cache or as a helper to some process. Ideally applications should be stateless \u0026amp; Stateful applications should be a dependant cog in the wheel. If your application does do some sort of persistence in a distributed manner then read \u0026amp; write replicas should be leveraged to ensure data is correct.\n","date":"21 September 2022","externalUrl":null,"permalink":"/blog/kubernetes-components-and-notes/","section":"Blog","summary":"","title":"Kubernetes Components with Terraform \u0026 Notes","type":"blog"},{"content":"","date":"7 July 2022","externalUrl":null,"permalink":"/tags/dbt/","section":"Tags","summary":"","title":"Dbt","type":"tags"},{"content":"","date":"7 July 2022","externalUrl":null,"permalink":"/tags/snowflake/","section":"Tags","summary":"","title":"Snowflake","type":"tags"},{"content":"Snowflake has a few ways to interact with external data, one of which is Stages and external tables. The docs on most things in Snowflake are amazing, so I\u0026rsquo;m not going to copy them here, this will just be a guide on how to set up large JSON blobs with these tools. We will be creating an External Table in Snowflake which queries the underlying JSON blob residing in an AWS s3 bucket.\nSetup # To get started we will be operating with AWS and Snowflake, so access to these would be required. Once you have these we will be working in steps:\ncreate internal Stage copy into from this Stage create External Stage create External Table To start we should create a PoC workspace, we will be using whichever is the default warehouse for your user for loading data:\nUSE ROLE sysadmin; CREATE OR REPLACE DATABASE stages_and_external_tables_poc; USE DATABASE stages_and_external_tables_poc; CREATE SCHEMA IF NOT EXISTS external_poc; USE SCHEMA external_poc; Now we can get started.\n1. Stages # Creating an internal stage is as simple as following the docs. We will be using some mock data I have pulled from mockaroo.\nHere\u0026rsquo;s the code to create the internal stage, to start. We will create a minimal named and internal stage \u0026amp; copy the files in. (we will add bells and whistles later)\nTable Stages\nEach table implicitly has a stage associated with it, named @%\u0026lt;table_name\u0026gt;. So we can just create a single column table in order to store our JSON blobs, explicit telling it were using a JSON blob rather than a CSV. For the file path structure; I\u0026rsquo;m using a Mac, but similar file structures are found here:\nCREATE OR REPLACE TABLE src_mock_data STAGE_FILE_FORMAT = ( TYPE = JSON ) ( file_content VARIANT ); PUT file:///Users/toby.devlin/dev/data/mocks/mock_1.json @%src_mock_data; PUT file:///Users/toby.devlin/dev/data/mocks/mock_2.json @%src_mock_data; PUT file:///Users/toby.devlin/dev/data/mocks/mock_3.json @%src_mock_data; PUT file:///Users/toby.devlin/dev/data/mocks/mock_4.json @%src_mock_data; PUT file:///Users/toby.devlin/dev/data/mocks/mock_5.json @%src_mock_data; LIST @%src_mock_data; Now we can copy into this table the files we\u0026rsquo;ve just uploaded. The data will be copied as everything into the variant type column we created. You can\u0026rsquo;t create more than 1 column with JSON -\u0026gt; VARIANT types, but CSVs can have multiple columns.\nCOPY INTO src_mock_data FROM @%src_mock_data; SELECT * FROM src_mock_data; Note that we don\u0026rsquo;t remove the files from the stage if we run LIST @%src_mock_data; again. There are ways to dictate how the stage files are treated and how semi-structured data is injected, see the copy-options for more info.\nNamed Stages\nNamed stages are almost the same as table stages, but you have to define them. They live in a schema and will appear when queried with SHOW STAGES and settings shown with DESCRIBE STAGE.\nCREATE OR REPLACE STAGE poc_internal_stage FILE_FORMAT = ( TYPE = JSON); SHOW STAGES; DESCRIBE STAGE poc_internal_stage; PUT file:///Users/toby.devlin/dev/data/mocks/mock_1.json @poc_internal_stage; PUT file:///Users/toby.devlin/dev/data/mocks/mock_2.json @poc_internal_stage; PUT file:///Users/toby.devlin/dev/data/mocks/mock_3.json @poc_internal_stage; PUT file:///Users/toby.devlin/dev/data/mocks/mock_4.json @poc_internal_stage; PUT file:///Users/toby.devlin/dev/data/mocks/mock_5.json @poc_internal_stage; LIST @poc_internal_stage; Again, we can copy these files into a table, but we will need to create a table if one doesn\u0026rsquo;t exist. This example also shows copy as a select statement, allowing arbitrary transform statements to be executed. $n is the \u0026ldquo;column\u0026rdquo;, always 1 in our JSON use case and :\u0026lt;element\u0026gt; refs the path in the JSON itself. We could also use selectors for specific files in the sage, such as data.$1:id from FROM @poc_internal_stag/mock_1 AS data, for example.\nCREATE OR REPLACE TABLE src_mock_data_2 ( file_content VARIANT, file_name STRING, record_id INT, record_last_name STRING ); COPY INTO src_mock_data_2 FROM (SELECT $1, metadata$filename, $1:id, $1:last_name FROM @poc_internal_stage); SELECT * FROM src_mock_data_2; 2. External Stages # As we progress through to externalizing the stages, the concept remains the same. We have files with data, and we want them to be data in a table. However, now these files may live in cloud storage object store. I\u0026rsquo;m most familiar with AWS, so I will create an S3 bucket to store these objects in. Below is the Terraform for creating a private bucket. It also uploads the local file data using the aws_s3_object. Be sure to replace the file path as above.\nNote: you will need to provide the Terraform user with policies in line with IAMFullAccess and AmazonS3FullAccess to complete the next steps. Either directly or via an assumed role; I\u0026rsquo;m using a shortcut and attaching these to the user.\nterraform { required_providers { aws = { source = \u0026#34;hashicorp/aws\u0026#34; version = \u0026#34;4.21.0\u0026#34; } } } provider \u0026#34;aws\u0026#34; { region = \u0026#34;us-east-1\u0026#34; access_key = \u0026#34;\u0026#34; # replace me secret_key = \u0026#34;\u0026#34; } module \u0026#34;s3_bucket\u0026#34; { source = \u0026#34;terraform-aws-modules/s3-bucket/aws\u0026#34; version = \u0026#34;3.3.0\u0026#34; bucket = \u0026#34;poc-snowflake-external-stage-bucket\u0026#34; acl = \u0026#34;private\u0026#34; versioning = { enabled = false } } resource \u0026#34;aws_s3_object\u0026#34; \u0026#34;object\u0026#34; { bucket = module.s3_bucket.s3_bucket_id key = \u0026#34;mock_${count.index+1}.json\u0026#34; source = \u0026#34;/Users/toby.devlin/dev/data/mocks/mock_${count.index+1}.json\u0026#34; etag = filemd5(\u0026#34;/Users/toby.devlin/dev/data/mocks/mock_${count.index+1}.json\u0026#34;) count = 5 } Now we have the bucket up and running we can create the stage in snowflake and provide it accesses. Along with the bucket, we will also create an IAM user to interact with the bucket rather than use our Terraform provisioner user. This is an optional step if you want to use your admin user you can if it has the permissions.\nNote: the file setup_stage.sql will contain the secret access keys for this user. You should ensure the values in this file remains secret; in the real world it may be worth placing this into something like AWS Secrets Manager or manually creating IAM creds as there are some security flaws using this shortcut approach.\nresource \u0026#34;aws_iam_user\u0026#34; \u0026#34;snowflake_s3_accessor\u0026#34; { name = \u0026#34;snowflake_s3_accessor\u0026#34; path = \u0026#34;/snowflake/\u0026#34; } resource \u0026#34;aws_iam_access_key\u0026#34; \u0026#34;snowflake_s3_accessor\u0026#34; { user = aws_iam_user.snowflake_s3_accessor.id } data \u0026#34;aws_iam_policy_document\u0026#34; \u0026#34;snowflake_s3_access\u0026#34; { statement { sid = \u0026#34;${replace(title(module.s3_bucket.s3_bucket_id), \u0026#34;-\u0026#34;, \u0026#34;\u0026#34;)}Access\u0026#34; actions = [ \u0026#34;s3:ListBucket\u0026#34;, \u0026#34;s3:GetBucketLocation\u0026#34; ] resources = [ \u0026#34;arn:aws:s3:::${module.s3_bucket.s3_bucket_id}\u0026#34;, ] } statement { sid = \u0026#34;${replace(title(module.s3_bucket.s3_bucket_id), \u0026#34;-\u0026#34;, \u0026#34;\u0026#34;)}ItemAccess\u0026#34; actions = [ \u0026#34;s3:PutObject\u0026#34;, \u0026#34;s3:GetObject\u0026#34;, \u0026#34;s3:GetObjectVersion\u0026#34;, \u0026#34;s3:DeleteObject\u0026#34;, \u0026#34;s3:DeleteObjectVersion\u0026#34; ] resources = [ \u0026#34;arn:aws:s3:::${module.s3_bucket.s3_bucket_id}/*\u0026#34;, ] } } resource \u0026#34;aws_iam_user_policy\u0026#34; \u0026#34;snowflake_s3_access\u0026#34; { name = \u0026#34;snowflake_s3_bucket_access\u0026#34; user = aws_iam_user.snowflake_s3_accessor.name policy = data.aws_iam_policy_document.snowflake_s3_access.json } resource \u0026#34;local_sensitive_file\u0026#34; \u0026#34;iam_creds_out\u0026#34; { filename = \u0026#34;setup_stage.sql\u0026#34; content = \u0026lt;\u0026lt;-EOT CREATE OR REPLACE STAGE poc_external_stage URL = \u0026#39;s3://${module.s3_bucket.s3_bucket_id}\u0026#39; CREDENTIALS = ( AWS_KEY_ID = \u0026#39;${aws_iam_access_key.snowflake_s3_accessor.id}\u0026#39; AWS_SECRET_KEY = \u0026#39;${aws_iam_access_key.snowflake_s3_accessor.secret}\u0026#39; ) FILE_FORMAT = (TYPE = JSON); EOT } Now we have everything we need to get started, we can begin the Snowflake side of things. We want to create the stage in a similar way to before but using the External syntax. All the variables needed can be found in setup_stage.sql.\nCREATE OR REPLACE STAGE poc_external_stage URL = \u0026#39;s3://poc-snowflake-external-stage-bucket\u0026#39; CREDENTIALS = (AWS_KEY_ID = \u0026#39;\u0026lt;your_access_key\u0026gt;\u0026#39; AWS_SECRET_KEY = \u0026#39;\u0026lt;your_secret_key\u0026gt;\u0026#39; ) FILE_FORMAT = ( TYPE = JSON ); SHOW STAGES; DESCRIBE STAGE poc_external_stage; LIST @poc_external_stage; Now the stage is created we can also see the existing files have been associated with the stage with LIST @poc_external_stage;. From here the same approach as before can be taken to copy in these files as if it were an internal named stage.\nCREATE OR REPLACE TABLE src_mock_data_3 ( file_content VARIANT, file_name STRING, record_id INT, record_last_name STRING ); COPY INTO src_mock_data_3 FROM (SELECT $1, metadata$filename, $1:id, $1:last_name FROM @poc_external_stage); SELECT * FROM src_mock_data_3; 3. External Tables # External tables are, syntactically, very similar to normal tables. As the docs describe they essentially read files from the remote store when requested. As with the above sections, there are steps we can take to improve the performance, security and isolation of these resources, but we will focus on getting it up and running.\nThe first step is to create the table itself.\nCREATE OR REPLACE EXTERNAL TABLE src_mock_data_external ( file_name STRING AS (metadata$filename), record_id INT AS (VALUE:\u0026#34;id\u0026#34;::INT), record_last_name STRING AS (VALUE:\u0026#34;last_name\u0026#34;::VARCHAR) ) LOCATION = @poc_external_stage FILE_FORMAT = (TYPE = JSON); The content will automatically be populated into a table we can query To test out the new data you can place new files into the bucket and run the refresh command ALTER EXTERNAL TABLE src_mock_data_external REFRESH; then the select query again, showing updates to the underlying data.\nSHOW EXTERNAL TABLES; SELECT * FROM src_mock_data_external; This manual refresh has to be done each time unless updates to this table are published to the AWS SQS topic that is created for the table. This can be SQS topic can be hit by anything, but the recommended way is by publishing s3 change events.\nCaveats \u0026amp; Considerations # When working with extremely large JSON blobs, larger than the max size (16,777,216 bytes) stages will fail their COPY INTO commands \u0026amp; External Tables will fail silently to load the data into the table. We have taken some shortcuts such as creating the AWS integration query in a local file - This can be hardened with proper secrets management \u0026amp; terraform state storage. Setting up change notifications on external stages is probably a very useful tool, meaning you could even process these files in multiple systems at once. Understanding the COPY INTO settings and various Stage settings allows for flexible operations on the stages files \u0026amp; how to reduce processing after load. The billing associated with Stages is part of the Snowflakes serverless billing and should be understood before heavyweight processing. ","date":"7 July 2022","externalUrl":null,"permalink":"/blog/snowflake-stages-and-external-tables-with-json-blobs-in-aws/","section":"Blog","summary":"","title":"Snowflake Stages and External Tables with JSON Blobs in AWS","type":"blog"},{"content":"sqlmodel is a very useful proxy tool that allows pydantic models and sqlalchemy models to be combined. This allows FastAPI, built by the same author, to intuitively know more about the models\u0026rsquo; metadata.\nThis guide will show you how to combine this tool into alembic, and create auto-migrations with a set of helper tools.\nThe Setup # We will need 2 functions in 2 files, which are pretty simple. The first is the engine creator.\n# database.py from sqlmodel import create_engine from sqlalchemy.engine import Engine def get_db_engine(url) -\u0026gt; Engine: return create_engine(url) Which will create the SQLAlchemy Engine object for us. This is used for alembic to connect to the database, I also use this function, cached, to generate Session objects. The next is used for the models.\n# model.py import uuid from typing import Optional from sqlmodel import Field, SQLModel class Hero(SQLModel, table=True): id: Optional[int] = Field(default=None, primary_key=True) name: str secret_name: str age: Optional[int] = None def get_metadata(): return SQLModel.metadata This file contains all our models, registering the metadata as tables, relationships, indexes and anything else that\u0026rsquo;s constructed. The get_metadata() method must be after the model definitions so the SQLModel.metadata is complete. From here we can touch the alembic env.py function to describe the online environment.\nfrom myapp.database import get_db_internal from myapp.model import get_metadata ... # `context` defined as before target_metadata = get_metadata() def run_migrations_online() -\u0026gt; None: connectable = get_db_internal() with connectable.connect() as connection: context.configure( connection=connection, target_metadata=target_metadata ) with context.begin_transaction(): context.run_migrations() The important bit here is the connection and metata are fetched directly from your application via the imports. This will set up all the correct connection details for the environment. To create production migrations you will have to generate the migrations on a host with access the prod environment (i.e. script opening a PR from a prod env).\nThe Execution # Now we can continue from normal, running alembic revision --autogenerate to create revisions on the delta of the database \u0026amp; metadata Models. Running this will create a new version in the correct directory.\nAt this point you may need to manually import sqlmodel to link together column types. Other changes should be made at this point.\nThen running alembic upgrade head will update the database up to the correct revision.\n","date":"5 June 2022","externalUrl":null,"permalink":"/blog/database-migration-with-sqlmodel-and-alembic/","section":"Blog","summary":"","title":"Database Migrations with sqlmodel and alembic","type":"blog"},{"content":"","date":"27 January 2022","externalUrl":null,"permalink":"/tags/al/","section":"Tags","summary":"","title":"Al","type":"tags"},{"content":" I\u0026rsquo;ve not written about my experience in Machine Learning at all. As a Student I studied the algorithms and proofs to back up many modeling techniques. As far as my professional role as a Senior Engineer goes, I\u0026rsquo;ve at least dabbled in most mainstream tools out there. And as a Senior Data Engineer, I\u0026rsquo;ve focused recently on the applications of tools in the data space, including modelling, munging and machine learning. This post is more of a refresher - something I can refer back to in order to quickly get back running on a project where at least a simplistic overview is required.\nSo what will I detail in the end? I\u0026rsquo;d like to show that ML in this post (and in most rudimentary models) is essentially curve fitting for lack of better words. The machine is learning how to predict a value y, based on a set of input variables x. This post isn\u0026rsquo;t meant to break the bounds of ML, here x is just a tuple of floats, a set of numeric values that describe the underlying data. We don\u0026rsquo;t go into handling categories or more structured, contextual data such as time-series or natural language.\nBy the end, you\u0026rsquo;ll have seen, at least from a regression point of view, a way of understanding goals and how models may be more accustomed to solving certain problems. If this interests me enough I might do a second part on how the underlying models work, or we could just jump into neural models and skip the old school ML. Side note: I\u0026rsquo;d like to do a history of ML at some point too.\nGetting Started # The first bits of python is always meta executions, setting up libs and the like - ill comment on what\u0026rsquo;s going on at each point.\nThe first cell contains commented out operations to install the notebook deps - I operate locally and use poetry for deps, the poetry.toml is in an appendix.\nIt\u0026rsquo;s worth noting this was written in a jupyter notebook\n# !pip install pandas==\u0026#34;^1.4.0\u0026#34; # !pip install plotly==\u0026#34;^5.5.0\u0026#34; # !pip install scikit-learn==\u0026#34;^1.0.2\u0026#34; # !pip install kaggle==\u0026#34;^1.5.12\u0026#34; # !pip install jupyter==\u0026#34;^1.0.0\u0026#34; # !pip install seaborn==\u0026#34;^0.11.2\u0026#34; # various imports, google for info import time import timeit import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt # default plot to sns and style to be nicer to the eyes in dark mode sns.set_theme( context=\u0026#39;notebook\u0026#39;, style=\u0026#39;darkgrid\u0026#39;, color_codes=True, font_scale=1.5, rc={\u0026#39;figure.figsize\u0026#39;: (20, 10)} ) # reproducibility SEED = 1234 np.random.seed(SEED) # Magic lines like this mean we can set notebook properties, see below for more # https://ipython.readthedocs.io/en/stable/interactive/magics.html # %config InlineBackend.figure_format = \u0026#39;svg\u0026#39; Data download # data should be extracted to the /data dir, it\u0026rsquo;s from kaggles housesalesprediction dataset, a great resource for learning ML. I hope you stumbled across that website before this post. To set up run the below, this will pull down a zip of the files we want. You\u0026rsquo;ll need to unzip and place in the format below.\npoetry run kaggle datasets download -d harlfoxem/housesalesprediction\n/data /... /kc_house_data.csv /- ... /house_prices.ipynb Looking at the data # This next section looks at the data in the resulting CSV. As far as data goes its lovely, with no horrible decisions we need to make about munging. in the real world this next step is much more important and can change the results from a mediocre model to an amazing one. Data is the fuel of the ML flame, wet kindling will give shitty results.\nThere are a few things I like to look for in data munging tabular records:\n**Overall size ** - ideally this should be known before loading the data into memory and potentially an iterative approach to be designed using chunking. **Num columns ** - is our dimensionality huge? would removing a large number impact us? are there any highly correlated cols? what about synthetic variables? **Column types ** - are we going to have trouble with certain fields (dates, non UTF-8 chars, serialized/nested fields, badly typed fields) **Column names ** - is there a description of what\u0026rsquo;s in that column? something I can sanity check with external data maybe? £50 house prices would raise alarms. Null/missing data - how will we deal with these records? are the null records a large chunk of the data? Ideally, this non-exhaustive list of checks should be completed upstream in our stack by tools such as dbt. As a Data scientist, I want to have a reproducible way of cleaning my data.\ndf = pd.read_csv(\u0026#39;data/kc_house_data.csv\u0026#39;) df.info() \u0026lt;class \u0026#39;pandas.core.frame.DataFrame\u0026#39;\u0026gt; RangeIndex: 21613 entries, 0 to 21612 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 21613 non-null int64 1 date 21613 non-null object 2 price 21613 non-null float64 3 bedrooms 21613 non-null int64 4 bathrooms 21613 non-null float64 5 sqft_living 21613 non-null int64 6 sqft_lot 21613 non-null int64 7 floors 21613 non-null float64 8 waterfront 21613 non-null int64 9 view 21613 non-null int64 10 condition 21613 non-null int64 11 grade 21613 non-null int64 12 sqft_above 21613 non-null int64 13 sqft_basement 21613 non-null int64 14 yr_built 21613 non-null int64 15 yr_renovated 21613 non-null int64 16 zipcode 21613 non-null int64 17 lat 21613 non-null float64 18 long 21613 non-null float64 19 sqft_living15 21613 non-null int64 20 sqft_lot15 21613 non-null int64 dtypes: float64(5), int64(15), object(1) memory usage: 3.5+ MB df.sort_values(\u0026#39;id\u0026#39;).head() | | id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |:----:|:-------:|:--------------------|:------:|:--------:|:---------:|:-----------:|:--------:|:------:|:----------:|:----:|:---------:|:-----:|:----------:|:-------------:|:--------:|:------------:|:-------:|:-------:|:--------:|:-------------:|:----------:| | 2497 | 1000102 | 2015-04-22 00:00:00 | 300000 | 6 | 3 | 2400 | 9373 | 2 | 0 | 0 | 3 | 7 | 2400 | 0 | 1991 | 0 | 98002 | 47.3262 | -122.214 | 2060 | 7316 | | 2496 | 1000102 | 2014-09-16 00:00:00 | 280000 | 6 | 3 | 2400 | 9373 | 2 | 0 | 0 | 3 | 7 | 2400 | 0 | 1991 | 0 | 98002 | 47.3262 | -122.214 | 2060 | 7316 | | 6735 | 1200019 | 2014-05-08 00:00:00 | 647500 | 4 | 1.75 | 2060 | 26036 | 1 | 0 | 0 | 4 | 8 | 1160 | 900 | 1947 | 0 | 98166 | 47.4444 | -122.351 | 2590 | 21891 | | 8411 | 1200021 | 2014-08-11 00:00:00 | 400000 | 3 | 1 | 1460 | 43000 | 1 | 0 | 0 | 3 | 7 | 1460 | 0 | 1952 | 0 | 98166 | 47.4434 | -122.347 | 2250 | 20023 | | 8809 | 2800031 | 2015-04-01 00:00:00 | 235000 | 3 | 1 | 1430 | 7599 | 1.5 | 0 | 0 | 4 | 6 | 1010 | 420 | 1930 | 0 | 98168 | 47.4783 | -122.265 | 1290 | 10320 | Looks like our data is clean - we have 20 columns where only some don\u0026rsquo;t make sense. All the values are numeric apart from date, which looks like a ISO 8061 string representation, which makes it easy for us to parse. Some of these can be classified as categories under the hood, so we could try One Hot Encoding them later after a baseline is expected. There are lots of years in here that could also be better off converted to relative values, a zip code that\u0026rsquo;s represented as a number and lat-long values which could be geo-aizied too. Location data as a concept is layered and in this case we may be better off ignoring some of them, well come to selecting data later. For now we are not going to touch the data other than the date to get it in the right format.\nOne thing to note is that this set of data contains multiple sales of the same house, for example house 1000102 which sold for $280,000 in 2014 then in 2015 for $300,000. This is something that should be incorporated into the model somehow, maybe in a synthetic field further down the line wrt the price - maybe some inflation adjusted value.\ndf[\u0026#39;date\u0026#39;] = pd.to_datetime(df[\u0026#39;date\u0026#39;]) Data Distribution # As I mentioned this is a very easy data source to begin with, and it\u0026rsquo;s a nice sample to understand too. Everyone understands (apart from the market obviously, as a first time buyer) how a house should be valued at a high level; big =\u0026gt; ££, beds =\u0026gt; $$ location =\u0026gt; €€.\ndf.describe() | | id | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |:------|:-----------:|:-------:|:--------:|:---------:|:-----------:|:-----------:|:--------:|:----------:|:--------:|:---------:|:-------:|:----------:|:-------------:|:--------:|:------------:|:-------:|:--------:|:--------:|:-------------:|:----------:| | count | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | | mean | 4.5803e+09 | 540088 | 3.37084 | 2.11476 | 2079.9 | 15107 | 1.49431 | 0.00754176 | 0.234303 | 3.40943 | 7.65687 | 1788.39 | 291.509 | 1971.01 | 84.4023 | 98077.9 | 47.5601 | -122.214 | 1986.55 | 12768.5 | | std | 2.87657e+09 | 367127 | 0.930062 | 0.770163 | 918.441 | 41420.5 | 0.539989 | 0.0865172 | 0.766318 | 0.650743 | 1.17546 | 828.091 | 442.575 | 29.3734 | 401.679 | 53.505 | 0.138564 | 0.140828 | 685.391 | 27304.2 | | min | 1.0001e+06 | 75000 | 0 | 0 | 290 | 520 | 1 | 0 | 0 | 1 | 1 | 290 | 0 | 1900 | 0 | 98001 | 47.1559 | -122.519 | 399 | 651 | | 25% | 2.12305e+09 | 321950 | 3 | 1.75 | 1427 | 5040 | 1 | 0 | 0 | 3 | 7 | 1190 | 0 | 1951 | 0 | 98033 | 47.471 | -122.328 | 1490 | 5100 | | 50% | 3.90493e+09 | 450000 | 3 | 2.25 | 1910 | 7618 | 1.5 | 0 | 0 | 3 | 7 | 1560 | 0 | 1975 | 0 | 98065 | 47.5718 | -122.23 | 1840 | 7620 | | 75% | 7.3089e+09 | 645000 | 4 | 2.5 | 2550 | 10688 | 2 | 0 | 0 | 4 | 8 | 2210 | 560 | 1997 | 0 | 98118 | 47.678 | -122.125 | 2360 | 10083 | | max | 9.9e+09 | 7.7e+06 | 33 | 8 | 13540 | 1.65136e+06 | 3.5 | 1 | 4 | 5 | 13 | 9410 | 4820 | 2015 | 2015 | 98199 | 47.7776 | -121.315 | 6210 | 871200 | heatmap = sns.heatmap(df.corr(), annot=True, fmt=\u0026#34;.2f\u0026#34;) heatmap pairplot = sns.pairplot(df) pairplot hist = df.hist() plt.tight_layout() Eyeballing the columns look like everything looks fine, there are some distributions are clearly one sides such as the renovation year \u0026amp; waterfront, view and sqft_lot. Its worth looking into these a little more, see if they\u0026rsquo;ll be useful. There are some clear correlations on some vars, ignoring the price column as that\u0026rsquo;s our y column, looks like there are sqft to sqft columns and bathrooms to sqft. Ultimately these make sense, so we will continue without removing/altering any of these.\nThere are some that I\u0026rsquo;d like to look into in more depth, namely\nsqft_lot - why is this all bunched up at the 0 end? waterfront - what do these numbers mean? view - what do these numbers mean? yr_renovated - how does this impact us if we move to relative dates? keys = [ \u0026#39;sqft_lot\u0026#39;, \u0026#39;waterfront\u0026#39;, \u0026#39;view\u0026#39;, \u0026#39;yr_renovated\u0026#39; ] for key in keys: unique_vals = df[key].unique() print(f\u0026#34;{key} has {len(unique_vals)} unique values, first 5: {unique_vals[:5]}\u0026#34;) # sqft_lot has 9782 unique values, first 5: [ 5650 7242 10000 5000 8080] # waterfront has 2 unique values, first 5: [0 1] # view has 5 unique values, first 5: [0 3 4 2 1] # yr_renovated has 70 unique values, first 5: [ 0 1991 2002 2010 1999] print(\u0026#34;skew:\u0026#34;, df.skew()[\u0026#39;sqft_lot\u0026#39;]) # skew: 13.060018959031755 print(df[\u0026#39;sqft_lot\u0026#39;].describe()) # count 2.161300e+04 # mean 1.510697e+04 # std 4.142051e+04 # min 5.200000e+02 # 25% 5.040000e+03 # 50% 7.618000e+03 # 75% 1.068800e+04 # max 1.651359e+06 # Name: sqft_lot, dtype: float64 print(\u0026#34;top 15 values by count\u0026#34;) # top 15 values by count print(df[\u0026#39;sqft_lot\u0026#39;].value_counts()[:15]) # 5000 358 # 6000 290 # 4000 251 # 7200 220 # 4800 120 # 7500 119 # 4500 114 # 8400 111 # 9600 109 # 3600 103 # 9000 93 # 3000 84 # 5100 78 # 7000 76 # 8000 76 # Name: sqft_lot, dtype: int64 Ok looks like we do have some categories in waterfront and view - though as they\u0026rsquo;re encoded already I\u0026rsquo;ll just leave them. sqft_lot looks lie its incredibly sqewed with the majority of the values below 10,000 and a max value of over 150x that. Im not too concerned as this probably has relevent data in, just with an intresting distribution.\nNext is to train a basic model as a baseline. As per the basic tutorial ill just use a decision tree as it\u0026rsquo;s relatively easy to understand.\nRunning Experiments # We will run the following experiments:\nThe first will use the default model with the original, relevant data. We will then tune the hyper-parameters for the model. Then we can apply some data transforms based on the decisions above and try steps 1) and 2) again. It\u0026rsquo;s worth noting, for now, we will only be doing regression analysis rather than classification. We could bucket the properties into categories then run a classifier on them, however that can be another post.\nThe prerequisites for doing this is to define our test/train splits and our model grading strategy, to ensure an accurate model. To do this I will define a function that can be reused.\nfrom sklearn.model_selection import train_test_split def get_data(_df=df, seed=SEED): _dim_predict = [\u0026#39;price\u0026#39;] _dim_features = list( # Typically, ill remove columns rather than add them together when the dimensionality is small set(df.columns) - set(_dim_predict) - { # this is just a unique id, as noted this means we may have multiple sales of the same house in the data \u0026#39;id\u0026#39;, } ) _x = _df[_dim_features] _y = _df[_dim_predict] # easy split of the data in a reproducible. _x_train, _x_test, _y_train, _y_test = train_test_split(_x, _y, random_state=seed) return _x_train, _x_test, _y_train, _y_test When it comes to getting the predictions we want to ensure we measure a relevant accuracy. There are lots of ways of doing this for regression, as discussed in this post and in the sklearn docs. Essentially the question is the same:\nhow far away from the correct answer am I?\nThis means answering things like penalizing \u0026ldquo;distance\u0026rdquo; from the true result, by squaring or just getting the absolute value of the distance and how to aggregate all our guesses, such as taking averages or min/max of our deltas. In my case, I\u0026rsquo;ll try 3:\nMean Absolute Error (\\(MAE\\)) Max Error (\\(ME\\)) Root Mean Squared Error (\\(RMSE\\)) It\u0026rsquo;s worth noting this is just a way of comparing 2 arrays of the same length. We could go into lots of detail around skewness and how that impacts predictions on very expensive/cheap houses or the bulk of our data. Grading a regression model is really important for analysing what, exactly, you want to be predicted from an ML model.\nIn real terms this means \u0026ldquo;given all my guesses, how much did I miss the mark by if I were to guess the cost of the house?\u0026rdquo;. For example, if I had a \\(MAE\\) of 50,000 I would interpret that as \u0026ldquo;given any price guess, I would be about 50,000 dollars off the mark\u0026rdquo;; a \\(RMSE\\) is the same but would be calculated directly as errors are squared before averaging rather than just the abs value. A \\(ME\\) would be just mean my guesses were within 50,000 dollars at all points, so that\u0026rsquo;s the best upper bound I could expect (well kinda, this isn\u0026rsquo;t the best analysis model). I\u0026rsquo;d expect the following:\n$$ MAE \u003c ME \u003c RMSE $$I will also return the mean value of the true prices to compare how the model performance actually stacks up\nfrom sklearn.metrics import mean_absolute_error, mean_squared_error, max_error def get_errors(y_true, y_predict): _mae = mean_absolute_error(y_true, y_predict) _me = max_error(y_true, y_predict) _rmse = mean_squared_error(y_true, y_predict, squared=False) mean = np.mean(y_true.values) return { \u0026#39;MAE\u0026#39;: _mae, \u0026#39;MAE_ratio\u0026#39;: _mae / mean, \u0026#39;ME\u0026#39;: _me, \u0026#39;ME_ratio\u0026#39;: _me / mean, \u0026#39;RMSE\u0026#39;: _rmse, \u0026#39;RMSE_ratio\u0026#39;: _rmse / mean, \u0026#39;y_true_mean\u0026#39;: mean # sanity check } Modelling # Step one, is the basic Linear Regression. Probably going to not work super well as the data is likely non-linear\nfrom sklearn.linear_model import LinearRegression # we expect numerical values for all dimensions df_lr = df.copy() df_lr[\u0026#39;date\u0026#39;] = pd.to_numeric(df_lr[\u0026#39;date\u0026#39;]) # this is converted to a unix timestamp # Fetch our data to start x_train, x_test, y_train, y_test = get_data(df_lr) # training the model linear_regressor = LinearRegression() linear_regressor.fit(x_train, y_train) # get a prediction result y_pred = linear_regressor.predict(x_test) linear_result = get_errors(y_test, y_pred) linear_result {'MAE': 162853.0350747235, 'MAE_ratio': 0.302607784870751, 'ME': 4157759.6482200027, 'ME_ratio': 7.725802817218291, 'RMSE': 238675.029580604, 'RMSE_ratio': 0.44349754962938126, 'y_true_mean': 538165.3850851221} From here let\u0026rsquo;s try a bunch of other regression modelling techniques to see what the best default is.\nimport time # this is how we test a single model, and produce a set of metrics to be measured. def do_basic_model_fitting(regressor): start = time.perf_counter_ns() _dt = df.copy() _dt[\u0026#39;date\u0026#39;] = pd.to_numeric(_dt[\u0026#39;date\u0026#39;]) _x_train, _x_test, _y_train, _y_test = get_data(_dt) regressor.fit(_x_train, np.ravel(_y_train)) _y_pred = regressor.predict(_x_test) return {\u0026#39;model\u0026#39;: str(regressor).strip(), \u0026#39;time_taken\u0026#39;: (time.perf_counter_ns() - start), **get_errors(_y_test, _y_pred)} # function for testing \u0026amp; aggregating multiple model tests def model_testing(_models): test_results = [] for model in models_1: res = do_basic_model_fitting(model) test_results.append(res) return pd.DataFrame(data=test_results) from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.neighbors import KNeighborsRegressor from sklearn.svm import SVR from sklearn.neural_network import MLPRegressor # random_state is set with np.random.seed() in one of the top cells. models_1 = [ GaussianProcessRegressor(random_state=SEED), KNeighborsRegressor(), DecisionTreeRegressor(random_state=SEED), RandomForestRegressor(random_state=SEED), GradientBoostingRegressor(random_state=SEED), SVR(), MLPRegressor(random_state=SEED) ] models_1_df = model_testing(models_1) # Add original OLS regression model results to the dataframe. models_1_df = pd.concat( [models_1_df, pd.DataFrame([{\u0026#39;model\u0026#39;: str(linear_regressor), **linear_result}])], ignore_index=True ) models_1_df[\u0026#39;RMSE_rank\u0026#39;] = models_1_df[\u0026#39;RMSE_ratio\u0026#39;].rank() models_1_df.set_index(\u0026#39;RMSE_rank\u0026#39;, inplace=True) models_1_df.sort_index() | RMSE_rank | model | time_taken | MAE | MAE_ratio | ME | ME_ratio | RMSE | RMSE_ratio | y_true_mean | |:---------:|----------------------------------------------|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:| | 1 | RandomForestRegressor(random_state=1234) | 8.90942e+09 | 67567.8 | 0.125552 | 2.41883e+06 | 4.49458 | 125670 | 0.233516 | 538165 | | 2 | GradientBoostingRegressor(random_state=1234) | 2.66899e+09 | 76000 | 0.141221 | 2.4128e+06 | 4.48339 | 132250 | 0.245743 | 538165 | | 3 | DecisionTreeRegressor(random_state=1234) | 1.49893e+08 | 99986.1 | 0.185791 | 2.375e+06 | 4.41314 | 183278 | 0.34056 | 538165 | | 4 | LinearRegression() | nan | 162853 | 0.302608 | 4.15776e+06 | 7.7258 | 238675 | 0.443498 | 538165 | | 5 | SVR() | 1.07078e+10 | 219488 | 0.407845 | 6.6125e+06 | 12.2871 | 361966 | 0.672592 | 538165 | | 6 | KNeighborsRegressor() | 1.12232e+09 | 263538 | 0.489698 | 6.58138e+06 | 12.2293 | 390482 | 0.725581 | 538165 | | 7 | GaussianProcessRegressor(random_state=1234) | 4.29049e+10 | 538098 | 0.999874 | 7.0625e+06 | 13.1233 | 642529 | 1.19392 | 538165 | | 8 | MLPRegressor(random_state=1234) | 2.74833e+09 | 9.43668e+12 | 1.75349e+07 | 9.55167e+12 | 1.77486e+07 | 9.43691e+12 | 1.75353e+07 | 538165 | Looks like the defaults for a few models are much more effective than others. Namely, the Random Forrest model. It\u0026rsquo;s worth looking at what each of these models actually does. I\u0026rsquo;ve selected models which take modelling from different paradigms to show that different approaches work better than others, and can depend on the underlying data set. An in-depth understanding of the model will help selection given a set of data, the sklearn docs group supervised learning approaches quite well.\nLooking at these models in more detail, descending wrt RMSE_rank, with default parameters:\n8) Nural Networks - MLPRegressor # Nural networks are pretty advanced tools that are based on the concept of layers of neurons being connected together via activation functions. They typically need to be tuned pretty heavily; it\u0026rsquo;s no surprise that this model did the works with defaults. Typically, using more in-depth libraries such as Keras or PyTorch is more suitable than SKLearns Multi-Layer tool especially for more detailed models. This example is VERY off, thanks to the defaults provided.\n7) Probabilistic - GaussianProcessRegressor # A Gaussian process is a pretty heavyweight statistical tool to estimate a set of observations that assumes you can correlate them over multidimensional normal distributions. To be honest I don\u0026rsquo;t understand the process in much detail since I haven\u0026rsquo;t studied classical statistics in much depth. Long story short, it\u0026rsquo;s good at fitting when your dimensions are inherently related to these distributions and, while lacking accuracy in high dimensions, can be trained quickly with approximation methods.\n6) Grouping - KNeighborsRegressor # Nearest neighbours algorithms try to locate other samples in the vector space that is represented by the dimensionality of the data. For example, we are using 19 dimensions at this point (as per the get_data() function). This means we are calculating a distance metric for all the records against all other records and keeping a select number of the closest ones to predict output value.\n5) Support Vector Machines - SVR # SVMs are really powerful tools that essentially split up your problem space into dimensions and draw a hyperplane across them, allowing you to divide or predict observations. They are incredibly powerful as they\u0026rsquo;re not limited by dimensionality and can be kernelized to enable any function to be used as predictors, making them many times more powerful than even transformed linear regression models. They do need to be trained in a more complex way than the defaults, which we will come on to.\n4) Linear Model - LinearRegression # It\u0026rsquo;s worth noting that this OLS model is a very simple regression and hence would not realistically be used unless the data is very simple. Many linear models can be generalized to produce polynomial regression models, typically with some sort of transform. In this case, I haven\u0026rsquo;t tried any higher-order regressions but there may be a good reason to try as they\u0026rsquo;re quick to fit and relationships in data rarely are linear and could be log/exp or even sin/tangential leading to quick wins; angular/geometric patterns are notoriously hard to fix with these methods.\n3) Tree-Based - DecisionTreeRegressor # Decision trees are understandable and intuitive as models go and can all be visualized as a result using open source, graphviz, tooling. The way they are created is by splitting the feature space at sections defined by a loss function. This leads to some quick times to generate and is relative to the number of nodes needing training. Unfortunately, they can be liable to under/over-fitting with the wrong hyperparameters so tuning \u0026amp; data balancing prior to fitting can be important. This blog post has some superb detail.\n2) Tree-Based - GradientBoostingRegressor # Gradient boosting sits in a type of model known as boosting ensemble models, these take into account multiple predictors one by one to reduce model bias. Sort of like a recommendation algorithm for sub algorithms. The default algorithm used for the heuristic is actually a decision tree, so it makes sense to have outperformed the single decision tree.\n1) Tree-based - RandomForestRegressor # This is another type of ensemble model. However, this is of the second type, an averaging model. It works by taking the average over multiple models. This seems to have worked the best due to the default parameters working very well for a baseline model. The concept of multiple decision trees, each with their own errors, being combined is the hope that these errors may cancel each other out over the course of many trees, forming a more accurate prediction.\nOverall we have learned that the best way, for **simple ** models, is combinations. However, I am convinced we can create a much more accurate regression model from some of the more simple predictors we have tried already with parameter tuning \u0026amp; data selection. Ideally, this would be considered while making an informed decision from the data and maybe running a few heuristic appropriators for model selection. Ultimately the flow of ML engineers should follow a similar pattern:\nSolution space is well-defined with success metrics Data is ingested Data is analysed and munging steps defined Initial model testing Intelligent selection of model types trialled Synthetic fields are created where it makes sense Model hyperparameter tuning and cross-validations applied Downstream marts produced with clear tests, cleaning and synthetic fields replicated Production Models trained \u0026amp; versioned artefacts produced Models deployed and monitored Overall, of this long list, we were mostly focused on the first 4 stages; which makes sense. This post makes use of understanding models and reflections of these basic understandings on predictions. The next stage will be to really start completing the models and narrowing down on a production-able model. To do this I\u0026rsquo;d like to apply this to previous models/data in this post and also look to find data that would be valuable to analyse in the wider world.\nNext big learning: https://www.kaggle.com/learn/intro-to-deep-learning\n","date":"27 January 2022","externalUrl":null,"permalink":"/blog/machine-learning-starting-blocks/","section":"Blog","summary":"","title":"Machine Learning Starting Blocks","type":"blog"},{"content":"This is part 2 of the off to the cloud series, where I put together some AWS and other cloud infra as a PoC into what we can do with the tools they provide. This section will cover RDS and ECS. Part 1 covered simple ECS setup and some AWS VPC networking.\nOff To The Cloud - Part 1 1245 words\u0026middot;6 mins Aws Terraform Cloud This is meant to be a PoC and has some security holes (looked at in detail in the last section) that should be patched before deploying.\nStep 2 - Cloud Native # This section of this tutorial assumes you\u0026rsquo;ve got experience with Docker \u0026amp; can containerize this runtime. There are ways of containerizing and running almost any application even if it needs access to a file system or some sort of specialized hardware. This currently post has no requirements for anything other than the stateless database access.\nIn this part we will migrate to ECS for the app runtime and RDS for the database. We will also take into account some resiliency. I will assume you have your provider set up and can run terraform deployments, if not, start with some terraform getting started guides.\nNote that our default tags look as below. Its good practice to tag your infra for billing and general identification.\nlocals { default_tags = { version : \u0026#34;2.2.4\u0026#34; project : \u0026#34;offToTheCloud\u0026#34; } } 2.1 Networking # At a start we will need a network; enough to provide some failover, so multi az, and some flow logs so we can monitor the flow of traffic on our vpc.\ndata \u0026#34;aws_region\u0026#34; \u0026#34;current\u0026#34; {} data \u0026#34;aws_caller_identity\u0026#34; \u0026#34;current\u0026#34; {} module \u0026#34;vpc\u0026#34; { source = \u0026#34;terraform-aws-modules/vpc/aws\u0026#34; version = \u0026#34;2.77.0\u0026#34; name = \u0026#34;step_2_vpc\u0026#34; cidr = \u0026#34;10.0.0.0/16\u0026#34; public_subnets = [ \u0026#34;10.0.0.0/24\u0026#34;, \u0026#34;10.0.1.0/24\u0026#34; ] azs = [ \u0026#34;${data.aws_region.current.name}a\u0026#34;, \u0026#34;${data.aws_region.current.name}b\u0026#34; ] enable_dns_support = true enable_dns_hostnames = true enable_flow_log = true create_flow_log_cloudwatch_iam_role = true create_flow_log_cloudwatch_log_group = true flow_log_cloudwatch_log_group_kms_key_id = aws_kms_key.flow_log_key.arn tags = merge(local.default_tags, { Name = \u0026#34;Step 2 VPC\u0026#34; }) } resource \u0026#34;aws_kms_key\u0026#34; \u0026#34;flow_log_key\u0026#34; { description = \u0026#34;flow_log_key\u0026#34; deletion_window_in_days = 7 policy = data.aws_iam_policy_document.kms_log_policy.json tags = local.default_tags } resource \u0026#34;aws_kms_alias\u0026#34; \u0026#34;flow_log_key\u0026#34; { target_key_id = aws_kms_key.flow_log_key.arn name_prefix = \u0026#34;alias/flow_log_key_\u0026#34; tags = local.default_tags } data \u0026#34;aws_iam_policy_document\u0026#34; \u0026#34;kms_log_policy\u0026#34; { policy_id = \u0026#34;kms_log_policy\u0026#34; statement { principals { type = \u0026#34;AWS\u0026#34; identifiers = [data.aws_caller_identity.current.arn] } actions = [\u0026#34;*\u0026#34;] resources = [\u0026#34;arn:aws:kms:*\u0026#34;] } statement { principals { type = \u0026#34;Service\u0026#34; identifiers = [\u0026#34;logs.${data.aws_region.current.name}.amazonaws.com\u0026#34;] } actions = [ \u0026#34;kms:Encrypt*\u0026#34;, \u0026#34;kms:Decrypt*\u0026#34;, \u0026#34;kms:ReEncrypt*\u0026#34;, \u0026#34;kms:GenerateDataKey*\u0026#34;, \u0026#34;kms:Describe*\u0026#34; ] resources = [\u0026#34;arn:aws:kms:*\u0026#34;] } } The above is relatively complex, but does what it says on the tin. The kms keys are for encrypting our flow logs and the dns names are for ease of use.\n2.2 - ECS # ECS is the runtime we will be using, it\u0026rsquo;s a managed container solution by AWS. You provide the image, they will run it for you. There are some \u0026ldquo;easy mode\u0026rdquo; configs we\u0026rsquo;ve selected to get started, which will be pointed out as we go through this tutorial. ECS is formed by a couple parts:\nCluster Service Task Container A Cluster can have multiple Services which are composed of Tasks and a Task can have multiple ** Containers**. For example, a microservice arch might have a web API and a booking service. These 2 tasks can run in the same Cluster as individual Services, each with a service definition. Services can then be broken down by instances of sets of containers known as Tasks. Then each of these Tasks might be composed of 1 or more containers, maybe a monitoring sidecar scraping metrics or a secrets manager sidecar providing secrets to the main container.\nWith these definitions defined, we can look into our implementation. Let\u0026rsquo;s start with the Cluster.\nresource \u0026#34;aws_kms_key\u0026#34; \u0026#34;ecs_command_key\u0026#34; { description = \u0026#34;ecs_command_key\u0026#34; deletion_window_in_days = 7 tags = local.default_tags } resource \u0026#34;aws_kms_alias\u0026#34; \u0026#34;ecs_command_key\u0026#34; { target_key_id = aws_kms_key.ecs_command_key.arn name_prefix = \u0026#34;alias/ecs_command_key_\u0026#34; tags = local.default_tags } resource \u0026#34;aws_cloudwatch_log_group\u0026#34; \u0026#34;ecs_logs\u0026#34; { name_prefix = \u0026#34;ecs_logs_\u0026#34; retention_in_days = 7 kms_key_id = aws_kms_key.ecs_log_key.arn tags = local.default_tags } resource \u0026#34;aws_ecs_cluster\u0026#34; \u0026#34;my_ecs_cluster\u0026#34; { name = \u0026#34;main_cluster\u0026#34; capacity_providers = [ \u0026#34;FARGATE\u0026#34;, \u0026#34;FARGATE_SPOT\u0026#34; ] configuration { execute_command_configuration { // https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ExecuteCommandConfiguration.html kms_key_id = aws_kms_key.ecs_command_key.arn logging = \u0026#34;OVERRIDE\u0026#34; log_configuration { cloud_watch_encryption_enabled = true cloud_watch_log_group_name = aws_cloudwatch_log_group.ecs_logs.name } } } depends_on = [ module.vpc ] tags = local.default_tags } This block creates us a command key, to secure our data in transit to the container, a log group with 7 days of retention in cloudwatch and a cluster that leverages these assets. The cluster itself has a name and uses fargate as a capacity provider rather than try to roll our own EC2 provider; this means we dont have control over the underlying hosts afaik - advanced usecases may require known infra via the EC2 provider.\nNow we have a cluster, let put define our task task on it. We only have an API which runs in docker, you can assume its prebuilt and in docker hub for this example. This is another easy mode step we cover below. This means to define our task we just need a few bits and pieces. Namely, the rest of the definition and the IAM roles, lets start on the task def.\nlocals { ecs_service_name = \u0026#34;api_service\u0026#34; ecs_container_name = \u0026#34;my_first_api\u0026#34; ecs_app_port = 80 } resource \u0026#34;aws_ecs_task_definition\u0026#34; \u0026#34;api_task\u0026#34; { family = local.ecs_service_name execution_role_arn = aws_iam_role.ecs_execution_role.arn task_role_arn = aws_iam_role.ecs_task_role.arn # https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerDefinition.html container_definitions = jsonencode([ { name = local.ecs_container_name image = \u0026#34;mradjunctpanda/off-to-the-cloud:0.1.0\u0026#34; essential = true portMappings = [ { containerPort = local.ecs_app_port hostPort = local.ecs_app_port } ] environment = [ { name = \u0026#34;PORT\u0026#34; value = \u0026#34;80\u0026#34; } ] logConfiguration = { logDriver = \u0026#34;awslogs\u0026#34; options = { awslogs-group = aws_cloudwatch_log_group.ecs_logs.name awslogs-region = data.aws_region.current.name awslogs-stream-prefix = \u0026#34;ecs\u0026#34; } } } ]) network_mode = \u0026#34;awsvpc\u0026#34; requires_compatibilities = [ \u0026#34;FARGATE\u0026#34; ] cpu = 256 memory = 512 tags = local.default_tags } The majority of this is self-explanatory with the complicated bits coming as the AWS specific parts. The log-configuration really is just the settings AWS needs to fufil your logging requirements, we created the pieces needed earlier. The docs have some more info on how this works.\nOne significant piece to understand is how ECS uses IAM to execute and schedule tasks. In fact, we have a whole file dedicated to creating these users.\ndata \u0026#34;aws_iam_policy_document\u0026#34; \u0026#34;ecs_execution_assume_policy_doc\u0026#34; { policy_id = \u0026#34;ecs_execution_assume_policy_doc\u0026#34; statement { actions = [\u0026#34;sts:AssumeRole\u0026#34;] principals { type = \u0026#34;Service\u0026#34; identifiers = [\u0026#34;ecs-tasks.amazonaws.com\u0026#34;] } } } data \u0026#34;aws_iam_policy_document\u0026#34; \u0026#34;ecs_execution_logging_policy_doc\u0026#34; { policy_id = \u0026#34;ecs_execution_logging_policy_doc\u0026#34; statement { actions = [ \u0026#34;logs:*\u0026#34; ] resources = [\u0026#34;arn:aws:logs:*\u0026#34;] } } # The Execution Role is for executing the runtime, which things the container can access resource \u0026#34;aws_iam_role\u0026#34; \u0026#34;ecs_execution_role\u0026#34; { name = \u0026#34;off_to_cloud_ecs_execution_role\u0026#34; assume_role_policy = data.aws_iam_policy_document.ecs_execution_assume_policy_doc.json tags = local.default_tags } # The Task Role is for the service running the tasks, wha needs to be accessed to start the containers resource \u0026#34;aws_iam_role\u0026#34; \u0026#34;ecs_task_role\u0026#34; { name = \u0026#34;off_to_cloud_ecs_task_role\u0026#34; assume_role_policy = data.aws_iam_policy_document.ecs_execution_assume_policy_doc.json tags = local.default_tags } resource \u0026#34;aws_iam_role_policy\u0026#34; \u0026#34;ecs_execution_logging\u0026#34; { name = \u0026#34;ecs_logging\u0026#34; role = aws_iam_role.ecs_execution_role.id policy = data.aws_iam_policy_document.ecs_execution_logging_policy_doc.json } resource \u0026#34;aws_iam_role\u0026#34; \u0026#34;ecs_service_role\u0026#34; { name = \u0026#34;ecs_service_role\u0026#34; assume_role_policy = data.aws_iam_policy_document.ecs_execution_assume_policy_doc.json } So what is this saying? Well we need 3 user roles - One for setting up the tasks (execution role) and one for running the tasks (task role). Both of these roles can be assumed via the ecs_execution_assume_policy_doc which basically says \u0026lsquo;allow the service principal ecs-tasks.amazonaws.com use this role\u0026rsquo;. We then attach a policy to the execution role, which allows it to do anything with logs. This is so that the execution user can place our logs into cloudwatch. There are no extra requirements for our task user at this point.\nThis setup is relatively simple again. Things like access to S3 buckets, secret keys and other AWS services are configured against the task and execution roles. For example putting a kms key on the service will require the _ execution role_ to have access inorder to place the key. Then if private s3 access is required, accesses should be granted to the task role for s3 access and if necessary, the vpce.\nOk, finally, after IAM and Task definitions we can define our service. The service is basically a, well, a service. It does stuff. In our case, running an API. We also want to place a few extra pointers into it though; like how its launched, the networking capabilities and a load balancer. We will come on to the load balancer in a bit but here is the service definition.\nresource \u0026#34;aws_ecs_service\u0026#34; \u0026#34;api_service\u0026#34; { name = local.ecs_service_name cluster = aws_ecs_cluster.my_ecs_cluster.arn desired_count = 3 launch_type = \u0026#34;FARGATE\u0026#34; task_definition = aws_ecs_task_definition.api_task.arn load_balancer { target_group_arn = aws_lb_target_group.rest_api_blue_group.arn container_name = local.ecs_container_name container_port = local.ecs_app_port } network_configuration { subnets = module.vpc.public_subnets assign_public_ip = true security_groups = [ aws_security_group.application_access.id ] } tags = local.default_tags } resource \u0026#34;aws_security_group\u0026#34; \u0026#34;application_access\u0026#34; { vpc_id = module.vpc.vpc_id name_prefix = \u0026#34;whoami_access_\u0026#34; ingress { cidr_blocks = [ \u0026#34;0.0.0.0/0\u0026#34; ] description = \u0026#34;http inbound\u0026#34; from_port = local.ecs_app_port to_port = local.ecs_app_port protocol = \u0026#34;tcp\u0026#34; } // bit of a hack, see below for a patch egress { from_port = 0 to_port = 0 protocol = \u0026#34;all\u0026#34; cidr_blocks = [ \u0026#34;0.0.0.0/0\u0026#34; ] ipv6_cidr_blocks = [ \u0026#34;::/0\u0026#34; ] } tags = local.default_tags } Note we\u0026rsquo;re also providing a security group - defining who can access our service. This is tru not only because of the security group, however. As part of the VPC definition we asked for public subnets, ones with public facing IPs. Hence, all containers spun up in this service will be assigned a public anda private IP. The max size of the service is the number of IPs in the CIDR block. If we assigned it a private subnet, it would only be accessible internally in our VPC.\nTwo things of significance in the security group to note are\nThe ingress is only for the application port. This is the \u0026ldquo;host\u0026rdquo; port, which then has to be open and mapped to our container (see the container definition for the \u0026ldquo;host\u0026rdquo;:container mapping) The egress means our containers can access anything. This is so we can pull the image from docker hub but is a ** major security flaw** if the containers are compromised. Patching this requires the image to be located and pulled internally from an AWS ECS Registry. The last thing to look at is the Load balancer and Auto Scaling Group.\n2.3 - Getting Scale # Scalability and reliability should be thought of at every point of your stack. This API example is no different. There are a few places we have baked in reliability and some we can change. For example, baked in:\nAWS services - they all have their own SLAs Our code/service - should have good paths to failover and retry operations internally. Other places we can add reliability is how we scale to meet demand. We automatically become more reliable if we can spin up and kill off containers that meet certain thresholds or fail health checks. Below is the definition of our load balancer.\nresource \u0026#34;aws_lb\u0026#34; \u0026#34;api_load_balancer\u0026#34; { name = \u0026#34;api-lb\u0026#34; internal = false load_balancer_type = \u0026#34;application\u0026#34; security_groups = [ aws_security_group.application_access.id ] subnets = module.vpc.public_subnets } resource \u0026#34;aws_lb_target_group\u0026#34; \u0026#34;rest_api_blue_group\u0026#34; { name = \u0026#34;rest-api-blue-group\u0026#34; vpc_id = module.vpc.vpc_id target_type = \u0026#34;ip\u0026#34; port = local.ecs_app_port protocol = \u0026#34;HTTP\u0026#34; health_check { healthy_threshold = 3 path = \u0026#34;/hello\u0026#34; port = local.ecs_app_port } depends_on = [ aws_lb.api_load_balancer ] lifecycle { create_before_destroy = true } } resource \u0026#34;aws_lb_listener\u0026#34; \u0026#34;api_listener\u0026#34; { load_balancer_arn = aws_lb.api_load_balancer.arn port = 80 # port receiving traffic on lb protocol = \u0026#34;HTTP\u0026#34; default_action { type = \u0026#34;forward\u0026#34; target_group_arn = aws_lb_target_group.rest_api_blue_group.arn } depends_on = [ aws_lb_target_group.rest_api_blue_group ] lifecycle { create_before_destroy = true } } Notice there are 3 parts. The Listener, which is the far end consumer facing part of the stack. Its set up to listen on port 80, and send its traffic to the target group. This is where we could attach domain names and TLS certs for a nice API domain, like api.tobydevlin.com, or add other types of actions . There are plenty of ways to leverage a Load Balancer Listener, my example is a very bland one. (i should really have added a cert but that\u0026rsquo;s left as an exercise of the reader.)\nIn the next stage is the Target Group. You can point to multiple target groups at one time in a blue-green deploy strategy if needed, hence the name of mine. In my case i just have a single group which listens on a port, and has a health check. Its configured to have a group of IPs in its pool, but can be configured with others. There is also capability for sticky routing and traffic distribution strategies where needed.\nFinally, we have the Load Balancer itself - the final piece of the puzzle. It\u0026rsquo;s placed on the service at creation and glues the pieces together. This LB can be shared via a vpce and has settings that tell AWS how to handle it. The main point is the type which defines how the lb should work and what layer of the network stack it operates in.\n2.4 - RDS \u0026amp; Data # \u0026hellip; Coming soon! When I have spare time.\nWrapping up - Easy Mode Considerations # Throughout this we have taken a few \u0026ldquo;easy mode\u0026rdquo; decisions, I\u0026rsquo;ve commented on the largest below.\nSingle VPC - it is possible to secure an app with a public and provate VPC, rather than subnets as we have done. This will allow much more control about who is able to access what. Ideally, in high security situations, noone should access the prod data apart from the API itself and approved egress solutions. This can be done by spinning up our RDS cluster in a second VPC and providing it as a service via a VPC Endpoint to our service, along with all the security bells and whistles. Fargate ECS Capacity Provider - as mentioned in section 2.2, we have defaulted for Fargate, AWSs serverless compute option. This means that we don\u0026rsquo;t won the underlying infra and hence don\u0026rsquo;t have full control. For example if we wanted to have dedicated underlying hosts or a bare metal host Fargate means this isn\u0026rsquo;t possible. This is something to keep in mind when managing ECS containers. Docker Hub Repo - At the end of the day, minimizing the tools you use can save a lot of money. This strategy of building an ECS task off a docker image stored in Dockers own registry saves us time by not having to set up a secure SDLC (I did it all manually tbh). In an ideal world the container image should be built in a CI and pushed to, in this case, an internal AWS ECS Container Registry which we have full control over who and how the containers are accessed and can be accessed via a more constrained security group. This is all out of scope, and maybe ill do a post on secure, scalable SDLC in another post. Full Egress on API Containers - Allowing egress on all ports and protocols is a hack to allow communication to docker hub (see above for how to mitigate this risk). By opening this up it allows anyone to egress information from our servers if they have access. Consider closing this if you\u0026rsquo;re implementing for production. ","date":"4 October 2021","externalUrl":null,"permalink":"/blog/off-to-the-cloud-part-2/","section":"Blog","summary":"","title":"Off To The Cloud - Part 2","type":"blog"},{"content":"","date":"4 October 2021","externalUrl":null,"permalink":"/tags/terraform/","section":"Tags","summary":"","title":"Terraform","type":"tags"},{"content":"This series of posts will be covering my journey to create a \u0026ldquo;traditional\u0026rdquo; web app using IAC, python and AWS. This can be thought of a port of a traditional enterprise app onto the cloud. The code can be found on my github in the offToTheCloud project, the real aws part begins at part 1.\nApp Architecture # A basic web REST API with a database - cookie cutter app most folk have written or worked on before. Adding a micro UI or integrating with a front end is as simple as updating the CORS settings; but we wont add this just yet.\nPretty basic setup which has an arbitrary app runtime, most likely in a legacy system this is just a process running on a host.\nStep 0 - Setting up Environment # Hopefully you\u0026rsquo;ve used git, python, docker and maybe terraform before, as that\u0026rsquo;s the stack for this post. I\u0026rsquo;ve set up the example repo over on GitHub, go through the readme to get more background on how to set up the app locally.\nAs with the rest of the steps, the code can be started from the IAC folders. In this first case we will use the local IAC folder. This uses docker compose locally and will set up the app to point to a local docker database with no data in to start. There are also instructions to set up without docker, but that requires a live postgres server or using a local/memory sqlite db.\ndocker compose --file iac\\local\\docker-compose.yml up -d These containers can also be killed with.\ndocker compose --file iac\\local\\docker-compose.yml down Now you can see the API on\nSetting up cloud env \u0026amp; Terraform intro # Moving on to the rest of the post we will be using Terraform (get started link) . Install the code and set up a remote back end to not lose the state (you\u0026rsquo;ll need to update this in the IAC code).The code will live in the iac/* files and each step can be created in the relevant steps folder with terraform commands.\nTo get up and running with terraform in AWS, make sure you have your aws credentials set up; follow the AWS CLI install and have that setup with you root or IAM admin creds.\nPart 1 - Migrating to cloud # Moving over to cloud will take a very simple, lift and shift, approach; an introduction to the AWS basics of cloud. We will focus on a small set of services:\nVPC Subnets Security Groups Availability Zones EC2 Before We Start # This tutorial will reduce the access to most instances from the world wide web, limiting access only to your own external facing IP address. We will also secure most access to via ssh using an ssh key. More info can be learned here but the basics are 1 - DONT SHARE YOUR PRIVATE KEY FILE!, its the id file for your computer and sharing it meand people can impersonate you. Thats about it, only share the id_rsa.pub part of the key.\nTo generate a new key, use the below command:\nssh-keygen -t rsa -b 4096 1.1 - Building The Network # First step will be creating some networking resources. Initially a VPC, the isolated box to put our resources in. We will use the terraform vpc module to shortcut a bit of work for us. What this look like is:\nmodule \u0026#34;vpc\u0026#34; { source = \u0026#34;terraform-aws-modules/vpc/aws\u0026#34; version = \u0026#34;2.77.0\u0026#34; name = \u0026#34;step_1_vpc\u0026#34; cidr = \u0026#34;10.0.0.0/22\u0026#34; public_subnets = [ \u0026#34;10.0.0.0/24\u0026#34; ] private_subnets = [ \u0026#34;10.0.1.0/24\u0026#34;, ] # only use one AZ, as its cheaper not to ship data to other subnets azs = [ \u0026#34;us-east-1a\u0026#34; ] enable_dns_support = true enable_dns_hostnames = true tags = merge(local.default_tags, { Name = \u0026#34;Step 1 VPC\u0026#34; }) } From here we will need a few more items, namely some security groups to allow access between resources.\nresource \u0026#34;aws_security_group\u0026#34; \u0026#34;my_ssh_access\u0026#34; { vpc_id = module.vpc.vpc_id ingress { from_port = 22 protocol = \u0026#34;tcp\u0026#34; to_port = 22 cidr_blocks = [ local.my_ip_cidr ] } tags = merge(local.default_tags, { Name = \u0026#34;SSH Access\u0026#34; }) } resource \u0026#34;aws_security_group\u0026#34; \u0026#34;https_access\u0026#34; { vpc_id = module.vpc.vpc_id ingress { from_port = 443 protocol = \u0026#34;tcp\u0026#34; to_port = 443 cidr_blocks = [ \u0026#34;0.0.0.0/0\u0026#34; ] } tags = merge(local.default_tags, { Name = \u0026#34;HTTPS Access\u0026#34; }) } resource \u0026#34;aws_security_group\u0026#34; \u0026#34;postgres_comms\u0026#34; { vpc_id = module.vpc.vpc_id ingress { from_port = 5432 protocol = \u0026#34;tcp\u0026#34; to_port = 5433 cidr_blocks = module.vpc.private_subnets_cidr_blocks } tags = merge(local.default_tags, { Name = \u0026#34;postgres comms\u0026#34; }) } Final part of the networking is the elastic IP we can associate to the public API to ensure the IP doesn\u0026rsquo;t change when rebuilding the host. This is an Amazon owned IP that were leasing and can associate with any of the pieces of infra we want to.\nresource \u0026#34;aws_eip\u0026#34; \u0026#34;elastic_ip\u0026#34; { instance = aws_instance.app_server.id vpc = true } 1.2 - Setting Up Hosts # The real basic lift and shift will move the code onto EC2 hosts. From here we can ssh in and execute commands to start our processes. Firstly, a pair of hosts for app and database; one in public and one in private subnets.\nresource \u0026#34;aws_key_pair\u0026#34; \u0026#34;personal_rsa_key\u0026#34; { key_name = \u0026#34;my_rsa_key\u0026#34; public_key = file(pathexpand(\u0026#34;~/.ssh/id_rsa.pub\u0026#34;)) } resource \u0026#34;aws_instance\u0026#34; \u0026#34;app_server\u0026#34; { # AMI is the default for us-east-1, may differ for different regions ami = \u0026#34;ami-0ff8a91507f77f867\u0026#34; instance_type = \u0026#34;t2.nano\u0026#34; subnet_id = module.vpc.public_subnets[0] key_name = aws_key_pair.personal_rsa_key.id security_groups = [ aws_security_group.my_ssh_access.id, aws_security_group.https_access.id ] tags = merge(local.default_tags, { Name = \u0026#34;App Server\u0026#34; }) } resource \u0026#34;aws_instance\u0026#34; \u0026#34;db_server\u0026#34; { ami = \u0026#34;ami-0ff8a91507f77f867\u0026#34; instance_type = \u0026#34;t2.nano\u0026#34; subnet_id = module.vpc.private_subnets[0] key_name = aws_key_pair.personal_rsa_key.id security_groups = [ aws_security_group.my_ssh_access.id, aws_security_group.postgres_comms.id ] tags = merge(local.default_tags, { Name = \u0026#34;Database Server\u0026#34; }) } Note how these both have aws_security_group.my_ssh_access security group attachment and the aws_key_pair resource. This is to allow us ssh access from our machine, this requires you to have completed the before we start section.\nTo make this easy, we should add some outputs that provide some details about what we just created. so the below is a couple outputs that will be useful.\noutput \u0026#34;app_server_ip_pub\u0026#34; { value = aws_instance.app_server.public_ip } output \u0026#34;app_server_ip_priv\u0026#34; { value = aws_instance.app_server.private_ip } output \u0026#34;db_server_ip_priv\u0026#34; { value = aws_instance.db_server.private_ip } 1.3 - Deployment # Now we have all our infra IaC set up, lets hit deploy. This is done with terraform init \u0026amp;\u0026amp; terraform apply, follow the steps through and enter yes, you should see something like the below:\nApply complete! Resources: 2 added, 1 changed, 2 destroyed. Outputs: app_server_ip_priv = \u0026#34;10.0.0.xxx\u0026#34; app_server_ip_pub = \u0026#34;54.86.xxx.xxx\u0026#34; db_server_ip_priv = \u0026#34;10.0.1.xxx\u0026#34; These are your results of the apply stage! Notice how there are 2 IPs with _priv and one with _pub. These mean you can only access the public application server from the internet. If we wanted we can now connect to the public app server with ssh 54.86.xxx.xxx and it will execute a private key authentication against the public key we provided in the aws_key_pair resource. There are however better ways of accessing these servers - a VPN connection into our VPC. The setup of this is outside the scope of this post, but once this is complete you will be able to access both hosts via the vpn with private key auth.\nOnce your infra is all set up it just takes logging into the host and executing the right commands to run this code. The real next steps are cloud native, so we will skip the overly complicated cli commands to finish off step 1 and move onto step 2.\n","date":"5 September 2021","externalUrl":null,"permalink":"/blog/off-to-the-cloud-part-1/","section":"Blog","summary":"","title":"Off To The Cloud - Part 1","type":"blog"},{"content":"Concurrency in python has become incredibly simple since the asyncio package was created. Any developer, with a small restructuring of flow and an extra couple of keywords, can create easily concurrent applications. With the addition of multiple processes, this can easily become parallel too with the help of the multiprocessing lib.\nBelow is a simple demo of a task that could include an IO-bound operation where the application waits on another process. There are a list of tasks and a simple execution in both synchronous and asynchronous fashion.\nimport asyncio import time class MyTask: duration: int name: str def __init__(self, name: str, duration: int = None): self.name = name self.duration = duration async def do_task(t: MyTask): print(f\u0026#39;\u0026gt;\u0026gt;\u0026gt; doing task {t.name}\u0026#39;) if t.duration: print(f\u0026#39;found wait of duration {t.duration} seconds\u0026#39;) await asyncio.sleep(t.duration) print(f\u0026#39;\u0026lt;\u0026lt;\u0026lt; finished task {t.name}\u0026#39;) async def main(): tasks = [ MyTask(name=\u0026#34;one\u0026#34;, duration=1), MyTask(name=\u0026#34;two\u0026#34;), MyTask(name=\u0026#34;three\u0026#34;, duration=3), ] # sync start = time.perf_counter() for task in tasks: await do_task(task) print(f\u0026#39;took {time.perf_counter() - start} seconds\u0026#39;) print(\u0026#34;-\u0026#34;*20) # async start = time.perf_counter() await asyncio.wait([ do_task(task) for task in tasks ]) print(f\u0026#39;took {time.perf_counter() - start} seconds\u0026#39;) if __name__ == \u0026#39;__main__\u0026#39;: asyncio.run(main()) This prints out the below.\n\u0026gt;\u0026gt;\u0026gt; doing task one found wait of duration 1 seconds \u0026lt;\u0026lt;\u0026lt; finished task one \u0026gt;\u0026gt;\u0026gt; doing task two \u0026lt;\u0026lt;\u0026lt; finished task two \u0026gt;\u0026gt;\u0026gt; doing task three found wait of duration 3 seconds \u0026lt;\u0026lt;\u0026lt; finished task three took 4.0182585 seconds -------------------- \u0026gt;\u0026gt;\u0026gt; doing task three found wait of duration 3 seconds \u0026gt;\u0026gt;\u0026gt; doing task two \u0026lt;\u0026lt;\u0026lt; finished task two \u0026gt;\u0026gt;\u0026gt; doing task one found wait of duration 1 seconds \u0026lt;\u0026lt;\u0026lt; finished task one \u0026lt;\u0026lt;\u0026lt; finished task three took 3.0027220999999997 seconds As we can see by running the waits as concurrent events we take the execution time + longest wait as opposed to the sum of the waits. Just by leveraging some of the power that asyncio provides, we can remove the majority of external processing waiting from our synchronous code. There are also options to fine-tune how these primitive awaitables are collected and when execution is handed back to the main process.\n","date":"6 June 2021","externalUrl":null,"permalink":"/blog/python-concurrency-async-await/","section":"Blog","summary":"","title":"Concurrency in Python with Async Await","type":"blog"},{"content":"This is a quick overview of the basic tools AWS provides.\nCompute Services # EC2 # A web service that provides\u0026rsquo; resizable compute capacity for web scale compute easier for devs\u0026rsquo;. Concepts:\nInstance Type - the compute, memory and storage definitions Root Device Type instance store - physical persistent store on the VM. Elastic Block (EBS) - persistent but not separate from the VM. Preferred. AMIs - Templates for the VM, config and os and data. Purchase Options On-Demand - any time to spin up or shut down Reserved - discounts for committing for a period of time (years) Savings plan - similar to above but doesn\u0026rsquo;t reserve capacity, includes Fargate and lambda. Spot - take advantages of low demand. Spins up when bid is above water, shuts down when bid is below water with 2 min warning. Dedicated - physical machine in the datacenter. Launching an EC2 takes moments. The \u0026lsquo;User Data\u0026rsquo; field is run when the instance is created. Terminate means to delete the whole instance.\nElastic Beanstalk # Beanstalk is similar to EC2 but automates deployment and scaling. It deals with the underlying EC2 instances for you. Beanstalk is free, the underlying infra is the cost source. Beanstalk includes monitoring, deployment, scaling \u0026amp; customization, databases, load balance, healthchecks. It can leverage many languages and typically used for web servers; all ingress is also autoconfigured. Its essentially a Heroku with more options.\nLambda # Serverless functional coding, cost per execution code. Has auto scale for demand and runs globally. Deploys via an s3 bucket.\nNetwork and Content Delivery # Route 53 # DNS as a service, all DNS option can be configured here. High availability and has features such as failover.\nVPC \u0026amp; Direct Connect # VPC is an isolated part of the cloud which can only be communicated with by services within the same VPC. Support for IPv4 and IPv6, subnets, IP Ranges, Route Tables and Network Gateways are allowed. Private and Public are both available, via ingress. NAT is available for private subnets.\nPlacing a service in a VPC means by default only services within that VPC can communicate with it. This can be changed by leveraging services such as Transit Gateway, Internet Gateway or NAT Gateway.\nDirect Connect is a dedicated network connection to the AWS datacenters to your on perm cloud.\nSubnets and Availability Zones # A single VPC contains one or more Subnets which exists in only one Availability Zone (basically a datacentre in a given region). A unit of compute, for example an EC2 instance exists in one and only one Subnet; these instances also cannot be moved between Available Zones.\nSubnets can be private or public. Public means that services within the Subnet can access the internet and be accessed by the internet; to make a Subnet public its associated Route Table has a route with an Internet Gateway as a target.\nPrivate is the opposite, with no access inbound or outbound. A NAT Gateway can be used to allow outbound connections from private subnets.\nHere is a handy association guide:\nAn Availability Zone has many VPCs A VPC has many Subnets A Subnet has a single associated Route Table A Route Table can have many Routes that point to things like: A NAT Gateway An Internet Gateway - making any associated Subnets public Transit Gateway Egress Only Gateway Many others\u0026hellip; API Gateway and CloudFront # Cloudfront is a CDN which is global and can run on the edge locations. It has security such as the Shield DDOS protection and the WAF firewall.\nAPI Gateway is an API management service, it integrates with many other services and has metrics automatically attached.\nElastic Load Balance # Distribution of traffic across compute services (EC2, ECS, Lambda, \u0026hellip;) across availability zones. There are 3 types:\nApplication Load Balancer (ALB) Vertical Scaling - Upgrading the type of resource, like increasing memory. Horizontal Scaling - Increase the number of instances behind the load balancer. Network Load Balancer (NLB) Classic Load Balancer Global Accelerator # This is the application that speeds up your runtimes, its based on static IP rather than DNS. Once a request hits an edge location on the AWS network it will funnel traffic through the AWS Network rather then public internet. It integrates with other services like EC2, ELB. This can also provide tolerance via IPs rather than DNS resolution.\nThis should be used in scenarios when you\u0026rsquo;re not running HTTP; things like UDP, VOIP or MQTT. Also when you\u0026rsquo;re using static IPs off the line, as it runs on IPs.\nElastic IP Addresses # By brining or using an amazon Elastic IP you can provide a external static IP bound to an EC2 instances Network Interface, allowing it to be hit by DNS or just via the internet.\nStorage Solutions # S3 # Contains a whole load of options such as web serving, web hosting, permissions etc. Its structures into tiers of access that dictates pricing.\nStandard Intelligent Tiering Infrequent Access - has a single availability option also Glacier - for archive solutions EBS and EFS and other File Systems # Elastic Block Store is connected to single EC2 instances, multiple storage types are available (ssd, hdd, \u0026hellip;) at different storage performance and encryption at rest.\nElastic File System is fully manages NFS file system for Linux supporting petabyte scale over multiple AZs. IT provides different store classes for frequent and infrequent access along with Lifecyle rules for content. It can also be mounted to multiple EC2 instances across zones simultaneously.\nFSx is the windows managed file system for things like NTFS, AD and SMB.\nSnowball and Snowmobile # Both are used for loading data into AWS when an upload connection wont cut it. Snowball is for moving petabyte scale data into AWS and processing data at the edge, Snowmobile is for even large clusters of data. Snowball is a physical drive that\u0026rsquo;s provided and couriered to a destination for load. Snowmobile is more like a truck, it is a truck, that operated the same.\nDatabase \u0026amp; Data Utils # Databases # RDS is a drop in relational database service, giving you all the database tooling without dealing with the underlying infra. Handles provisioning, patching, backup and recovery of databases across multiple availability zones. RBS supports a host of database technologies such as MySQL and Postgres along with many others, including Aurora which is compatible in both technologies.\nDynamo DB is a both a document and key-value orientated NoSQL database. Its fully managed so you don\u0026rsquo;t need to manage the database layer either, featuring auto-scaling and caching.\nDatabase migration Service # Used for moving data into RDS, is a one time or continues thing. Cost is defined by compute required for the sync.\nElasticache # In memory datastore cache which is fully managed. Supports both Memcached and Redis with low latency. Autoscaling is available and handles many standard common use cases like database cache and sessions.\nRedshift # Data warehousing in the cloud with petabyte scale. Exists with VPC isolation, columnar storage and encryption rest. You can also use it to query data in S3.\nApp Integrations # SNS # The Simple Notification Service is essentially a pub/sub message service. It\u0026rsquo;s basically Kafka in the cloud, so event-based and integrates well with other AWS services.\nSQS # SQS can be used in conjunction with SNS but is more stateful. IT can be configured with or without order. A basic architecture would publish to SNS then publish to multiple SQS queues to manage the read part further down the line.\nStep Functions # Manages workflows though a fully managed service. Cost is based on state transitions and also the underlying architecture. Step Functions are good for organizing BPMN processes as an execution service.\nCloud Management # Trusted Advisor # This service allows you to look at your best practices, performance and fault tolerance. It is a single pane of glass into your AWS account.\nUsers in AWS # root users are special - they are owners of the account. IAM users are users within the account. IAM users should be used at all places if possible. You can create access keys in the IAM service to using in SDKs and the CLI.\nCloudTrail # Log, monitor and retain account activity across the infrastructure, it outputs into S3 and also Cloudwatch. It can also aggregate over many organizations for a distributed view.\nCloudWatch # Cloudwatch is the metrics, logging and alarms for your runtimes, including dashboards.\nAWS Config \u0026amp; Systems Manager # AWS Config keeps evaluating infra against a set of rules to keep events correct. It can be set up to manitor custom rules and has builtins.\nSystems Manager is a tool for managing infrastructure, specifically around things like systems updates and access to servers but via your own creds.\nCloudFormation # IaC for all infrastructure in AWS. Its a free tool to deploy the services by writing json or yml and manages dependencies of deployments for you. It also fixes config drift when it occurs. Very similar to terraform.\nOps works # Managed service for Chef and Puppet, which are config as code providers.\nControl Tower and Organizations # When you need child accounts this is it. consolidate billing, security and compliance across all these. Control Tower is the way that Organizations is configured, for example by creating child accounts by config, consolidating all the AWS accounts created and Guardrails to ensure child accounts don\u0026rsquo;t do things they shouldn\u0026rsquo;t.\n","date":"29 April 2021","externalUrl":null,"permalink":"/blog/aws-basics/","section":"Blog","summary":"","title":"AWS Basics","type":"blog"},{"content":"","date":"17 March 2021","externalUrl":null,"permalink":"/tags/java/","section":"Tags","summary":"","title":"Java","type":"tags"},{"content":"","date":"17 March 2021","externalUrl":null,"permalink":"/tags/json/","section":"Tags","summary":"","title":"Json","type":"tags"},{"content":"","date":"17 March 2021","externalUrl":null,"permalink":"/tags/logging/","section":"Tags","summary":"","title":"Logging","type":"tags"},{"content":"Adding Structured JSON logging when using SLF4J is quite simple once you understand the logger structure. This post will cover a quick implementation of how to add JSON structured logging to your app with SLF4J using Logback and the logstash-logback-encoder.\nFirst a bit of background. # Typically, all logs follow a certain pattern in every language. That is as follows, generated -\u0026gt; picked up -\u0026gt; parsed and enriched -\u0026gt; splat out in the correct place/places. For example, in Java, a logger in HelloWorld.java, set up via SLF4J (an adapter) and Logback (a log handler), can create a log event with a timestamp, level, source info and a message. This event will be picked up by a Logback adapter, parsed for level, potentially enriched with some more data then forwarded on to be spat out in a console, file or a udp/tcp stream or something else. This is common in other languages too, such as python and Javascript - only with some caveats on the semantics.\nJava Example # I\u0026rsquo;m going to assume you have a typical project set up, being built with Maven and it runs. Below is an example of my HelloWorld class and the initial SLF4J dependancy:\npackage com.tobydevlin.examples.logging import org.slf4j.Logger; import org.slf4j.LoggerFactory; public class HelloWorld { private static final Logger LOG = LoggerFactory.getLogger(HelloWorld.class); public static void main(String[] args) { LOG.info(\u0026#34;hello world!\u0026#34;); } } \u0026lt;dependencies\u0026gt; \u0026lt;dependency\u0026gt; \u0026lt;groupId\u0026gt;org.slf4j\u0026lt;/groupId\u0026gt; \u0026lt;artifactId\u0026gt;slf4j-simple\u0026lt;/artifactId\u0026gt; \u0026lt;version\u0026gt;1.7.30\u0026lt;/version\u0026gt; \u0026lt;/dependency\u0026gt; \u0026lt;/dependencies\u0026gt; All it does is log out a single thing, and when you run it you get this:\n[main] INFO HelloWorld - hello world! The next step would be to add some more intelligent handling under the hood, at the moment we\u0026rsquo;re using the built-in java logging framework. So lets add 2 dependencies to our POM, 2 for Logback, the more powerful logger that works with the SLF4J facade, and one for logstash-logback-encoder, an extension to allow us to log in json easily.\n\u0026lt;dependencies\u0026gt; \u0026lt;!-- Logging --\u0026gt; \u0026lt;dependency\u0026gt; \u0026lt;groupId\u0026gt;org.slf4j\u0026lt;/groupId\u0026gt; \u0026lt;artifactId\u0026gt;slf4j-api\u0026lt;/artifactId\u0026gt; \u0026lt;version\u0026gt;1.7.30\u0026lt;/version\u0026gt; \u0026lt;/dependency\u0026gt; \u0026lt;dependency\u0026gt; \u0026lt;groupId\u0026gt;ch.qos.logback\u0026lt;/groupId\u0026gt; \u0026lt;artifactId\u0026gt;logback-classic\u0026lt;/artifactId\u0026gt; \u0026lt;version\u0026gt;1.2.3\u0026lt;/version\u0026gt; \u0026lt;/dependency\u0026gt; \u0026lt;dependency\u0026gt; \u0026lt;groupId\u0026gt;ch.qos.logback\u0026lt;/groupId\u0026gt; \u0026lt;artifactId\u0026gt;logback-core\u0026lt;/artifactId\u0026gt; \u0026lt;version\u0026gt;1.2.3\u0026lt;/version\u0026gt; \u0026lt;/dependency\u0026gt; \u0026lt;!-- Json Logging--\u0026gt; \u0026lt;dependency\u0026gt; \u0026lt;groupId\u0026gt;net.logstash.logback\u0026lt;/groupId\u0026gt; \u0026lt;artifactId\u0026gt;logstash-logback-encoder\u0026lt;/artifactId\u0026gt; \u0026lt;version\u0026gt;6.6\u0026lt;/version\u0026gt; \u0026lt;/dependency\u0026gt; \u0026lt;/dependencies\u0026gt; From here Logback only requires a logback.xml config file on the classpath to be configured with formatting and such. Below is the first, basic, config, to log everything to a file and to the console with a given format. Place this in your resources folder.\n\u0026lt;configuration\u0026gt; \u0026lt;appender name=\u0026#34;STDOUT\u0026#34; class=\u0026#34;ch.qos.logback.core.ConsoleAppender\u0026#34;\u0026gt; \u0026lt;encoder\u0026gt; \u0026lt;pattern\u0026gt;%d{HH:mm:ss.SSS} | [%thread] | %-5level | %logger{36} | %msg%n\u0026lt;/pattern\u0026gt; \u0026lt;/encoder\u0026gt; \u0026lt;/appender\u0026gt; \u0026lt;appender name=\u0026#34;FILE\u0026#34; class=\u0026#34;ch.qos.logback.core.FileAppender\u0026#34;\u0026gt; \u0026lt;file\u0026gt;myApp.log\u0026lt;/file\u0026gt; \u0026lt;encoder\u0026gt; \u0026lt;pattern\u0026gt;%d{HH:mm:ss.SSS} | [%thread] | %-5level | %logger{36} | %msg%n\u0026lt;/pattern\u0026gt; \u0026lt;/encoder\u0026gt; \u0026lt;/appender\u0026gt; \u0026lt;root level=\u0026#34;info\u0026#34;\u0026gt; \u0026lt;appender-ref ref=\u0026#34;STDOUT\u0026#34;/\u0026gt; \u0026lt;appender-ref ref=\u0026#34;FILE\u0026#34;/\u0026gt; \u0026lt;/root\u0026gt; \u0026lt;/configuration\u0026gt; When run this should now print a more detailed message to the console and also a file myApp.log should have also been made with a copy of the output.\n18:39:14.687 | [main] | INFO | HelloWorld | hello world! Now to JSONafy these logs. The logstash libs make this incredibly easy, allowing us to just use one of their encoder classes. My new logback.xml file looks like the below.\n\u0026lt;configuration\u0026gt; \u0026lt;appender name=\u0026#34;STDOUT\u0026#34; class=\u0026#34;ch.qos.logback.core.ConsoleAppender\u0026#34;\u0026gt; \u0026lt;encoder class=\u0026#34;net.logstash.logback.encoder.LogstashEncoder\u0026#34;\u0026gt; \u0026lt;/encoder\u0026gt; \u0026lt;/appender\u0026gt; \u0026lt;root level=\u0026#34;info\u0026#34;\u0026gt; \u0026lt;appender-ref ref=\u0026#34;STDOUT\u0026#34;/\u0026gt; \u0026lt;/root\u0026gt; \u0026lt;/configuration\u0026gt; This one doesn\u0026rsquo;t log to a file, but can be configured very easily. A benefit of this encoder is it will automatically add information added with their net.logstash.logback.argument.StructuredArguments.* utilities. A good example of this is the value() and keyValue() methods, or v() and kv() for short. I\u0026rsquo;ve rewritten the main method to include some pairs.\npublic static void main(String[] args) { LOG.info(\u0026#34;hello world! {} {}\u0026#34;, keyValue(\u0026#34;name\u0026#34;, \u0026#34;toby\u0026#34;), value(\u0026#34;time_taken\u0026#34;, 123456)); LOG.info(\u0026#34;hello world! {} {}\u0026#34;, v(\u0026#34;name\u0026#34;, \u0026#34;toby\u0026#34;), kv(\u0026#34;time_taken\u0026#34;, 123456)); } Which will output\n{\u0026#34;@timestamp\u0026#34;:\u0026#34;2021-03-17T12:51:07.180Z\u0026#34;,\u0026#34;@version\u0026#34;:\u0026#34;1\u0026#34;,\u0026#34;message\u0026#34;:\u0026#34;hello world! name=toby 123456\u0026#34;,\u0026#34;logger_name\u0026#34;:\u0026#34;HelloWorld\u0026#34;,\u0026#34;thread_name\u0026#34;:\u0026#34;main\u0026#34;,\u0026#34;level\u0026#34;:\u0026#34;INFO\u0026#34;,\u0026#34;level_value\u0026#34;:20000,\u0026#34;name\u0026#34;:\u0026#34;toby\u0026#34;,\u0026#34;time_taken\u0026#34;:123456} {\u0026#34;@timestamp\u0026#34;:\u0026#34;2021-03-17T12:51:07.196Z\u0026#34;,\u0026#34;@version\u0026#34;:\u0026#34;1\u0026#34;,\u0026#34;message\u0026#34;:\u0026#34;hello world! toby time_taken=123456\u0026#34;,\u0026#34;logger_name\u0026#34;:\u0026#34;HelloWorld\u0026#34;,\u0026#34;thread_name\u0026#34;:\u0026#34;main\u0026#34;,\u0026#34;level\u0026#34;:\u0026#34;INFO\u0026#34;,\u0026#34;level_value\u0026#34;:20000,\u0026#34;name\u0026#34;:\u0026#34;toby\u0026#34;,\u0026#34;time_taken\u0026#34;:123456} Now we can add any number of parameters and they will be placed into the output as JSON keys. There are lots more meta information enrichment features and streaming add-ins to the logstash api, more info in the resources.\nResources # slf4j ref logback.xml ref logback.xml log level paths logstash encoder github ","date":"17 March 2021","externalUrl":null,"permalink":"/blog/structured-logging-in-java-with-slf4j-and-logback/","section":"Blog","summary":"","title":"Structured Logging in Java with SLF4j and Logback","type":"blog"},{"content":"","date":"14 November 2020","externalUrl":null,"permalink":"/tags/.net/","section":"Tags","summary":"","title":".NET","type":"tags"},{"content":"","date":"14 November 2020","externalUrl":null,"permalink":"/tags/ai/","section":"Tags","summary":"","title":"Ai","type":"tags"},{"content":"","date":"14 November 2020","externalUrl":null,"permalink":"/tags/ocr/","section":"Tags","summary":"","title":"Ocr","type":"tags"},{"content":" My First Code Experience # During college, I took a computing module. It was taught by an enthusiastic man who, because of this course, has influenced my career decision to move into software and taught me to know just enough to solve the problems at hand and argue every last detail until proven wrong; so thanks, Joe. In retrospect, I probably should have made more effort with this project but the name of the game was to just pass, looking back at it now this thing was full of security flaws and was built under a flawed development model. It was great!\nThis code was initially developed on a rubbish laptop that sometimes turned off unexpectedly, with the entirety of development isolated from any type of version control or any type of backup. However, to submit the code to the exam folk the entire piece of work: design proposal, waterfall dev cycles, code, 143 pages of manual test evidence, had to be printed, bound and shipped off - all for that vital number telling me how clever I was (am).\nThis post will be dedicated to the story of how I brought it back from the **paper ** I was forced to print it out on and into the 21st century - open-source crap dumped on Github.\nThe First Steps # The code is on paper so I need to have the entire 74-page collection of code transcribed into actual text. From there I can look into building the class structure and finally running the app.\nPictures! # I have painstakingly photographed every page of code that\u0026rsquo;s required with my phone as my data collection step. This includes all the SQL files where, for some foreshadowing reason, I picked ' chicken\u0026rsquo; for all my passwords. Security at its finest.\nMy favourite part of this, outside the code, is the mistakes I make but obviously not fixed (or fixed by pen, earlier in the project), like the well described \u0026ldquo;Center Number\u0026rdquo; I give in every header of the project doc. Can you tell I was destined for great things?\nFrom here all I now need to do is turn this into a block of text I can plop into a file and run!\nImage to Text work # Or Optical Character Recognition (OCR) as it\u0026rsquo;s known, is the next phase. This is a boring basic part - yes I could write my own OCR and have it all generate correctly but this is not the point of this blog post (and will be left as an exercise to the reader). The real solution here is using the cloud to speed up the work. Here are the options:\nAWS Rekognition (not amazing) Google Drive image to text (better) Azure computer vision (pretty good) This online tool (good but manual copy-paste will suck) Many of these require URLs of images to be provided, all the images for this project are open-sourced under the Github repo. Ultimately Azure seems to take the cake, so we will use its pretty good recognition and then tidy up after.\nTo analyse the images, we use the API with the below function:\ndef make_request(img_url: str, print_res: bool = False) -\u0026gt; str: # Make initial request recognize_handw_results = computervision_client.read(img_url, raw=True) # Parse response to get op ID operation_location_remote = recognize_handw_results.headers[\u0026#34;Operation-Location\u0026#34;] operation_id = operation_location_remote.split(\u0026#34;/\u0026#34;)[-1] # Await the response while remote does processing (nast impl but I dont care) while True: get_handw_text_results = computervision_client.get_read_result(operation_id) if get_handw_text_results.status not in [\u0026#39;notStarted\u0026#39;, \u0026#39;running\u0026#39;]: break time.sleep(1) # Print the detected text, line by line if get_handw_text_results.status == OperationStatusCodes.succeeded: for text_result in get_handw_text_results.analyze_result.read_results: if print_res: for line in text_result.lines: print(line.text) return text_result.lines Wrapping this for a given image is easy.\ndef pipe_image_to_txt(image_file_path: str) -\u0026gt; (str, str): # form known url image_url = GH_BASE + image_file_path # Make request and get data res = make_request(image_url) # Join resulting data to a string out_text = \u0026#34;\\n\u0026#34;.join(t.text for t in res) # dump into an out file out_file_name = images[0].replace(\u0026#34;.webp\u0026#34;, \u0026#34;.txt\u0026#34;) with open(f\u0026#34;out/{out_file_name}\u0026#34;, \u0026#34;w\u0026#34;) as f: f.write(out_text) # util result return image_file_path, out_file_name and finally, loop!\nfor i, image_file_path in enumerate(images[:1]): original, test_file = pipe_image_to_txt(image_file_path) print(i,, \u0026#34;|\u0026#34;, original, \u0026#34;-\u0026gt;\u0026#34;, test_file) The result is now all pretty and gives us a nice output for where the content was dumped:\nTaking a look at one of the files we can kinda see how it looks:\nImports MySql.Data. MySqlClient Public Class MainMenu #Region \u0026#34;DB connection strings and such\u0026#34; Public connectionString As String = \u0026#34;Database=Computing2;\u0026#34; \u0026amp; - \u0026#34;Data Source=localhost;\u0026#34; \u0026amp; - \u0026#34;User Id=toby; \u0026#34; \u0026amp; \u0026#34;Password=chicken; \u0026#34; \u0026#39; \u0026#34;database=Computing\u0026#34; selects database from the list of mysql databases, use \u0026#34;show databases\u0026#34; in mysql \u0026#34;data Source = localhost\u0026#34; uses the local machine ip as the data source for the database \u0026#39; \u0026#34;User ID=toby\u0026#34; selects the user from table mysql.users called toby with all the ... But from here we can update enough in order to create the actual files! We will first need to rename the images to provide a little more context - the pipeline can then do the rest. Each image is part \\(i\\) of \\(n\\), by naming them as such I can just merge them later on without too much fuss. After we have files its a matter of partial rewrites based off the partial VB .net code.\nMoving to VB .NET # Now I have managed to convert all these images to (somewhat comprehensive) text files, the process to .NET can begin. This, however, is actually a pretty manual process after dumping all the code into 1 file. The snippet below is the aggregator for the separate files, which is then renamed to .vb extension for processing. To do this I can leverage the power of regexp, as I did in my project, to complete this!\nfiles = glob(\u0026#34;out/*.txt\u0026#34;) f_map = dict() # Group all files by their \u0026#39;className\u0026#39; for f in files: # Funky regexp to isolate them f_key = re.search(\u0026#34;[a-zA-Z]+[0-9]+\u0026#34;, f).group()[:-1] if f_key in f_map.keys(): f_map[f_key] += [f] else: f_map[f_key] = [f] # dump them into code files for class_name in f_map.keys(): # get list of files txt_files = f_map[class_name] # extract and append to content content = \u0026#34;\u0026#34; for txt_f in sorted(txt_files): with open(txt_f, \u0026#34;r\u0026#34;) as f: content = content + f.read() # write this out to a file with open(f\u0026#34;code/{class_name}.vb\u0026#34;, \u0026#34;w\u0026#34;) as f: f.write(content) This produces a lovely set of classes I can now mess about with. After reformatting and filling in the missing characters, fixing spelling and such there\u0026rsquo;s only one issue - that is that VB.NET is a windows programming language which doesn\u0026rsquo;t have a solution on Linux. From here all I must do is port over to my other windows machine and start building!\nCheck out the repo of all the actual code here.\nTo be continued!\n","date":"14 November 2020","externalUrl":null,"permalink":"/blog/restoring-my-college-vb-net-application-p1/","section":"Blog","summary":"","title":"Restoring My College VB .Net Application - Part 1","type":"blog"},{"content":"","date":"21 October 2020","externalUrl":null,"permalink":"/tags/quiz/","section":"Tags","summary":"","title":"Quiz","type":"tags"},{"content":"Here is the Quiz round! Each image is a top-down view of a London landmark - can you tell me what they are?\nQuestion 1: # Question 2: # Question 3: # Question 4: # Question 5: # Question 6: # Question 7: # Question 8: # Question 9: # Question 10: # Question 11: # Question 12: # ","date":"21 October 2020","externalUrl":null,"permalink":"/blog/quiz-round-top-down-london/","section":"Blog","summary":"","title":"Toby's Quiz Round: Top Down London","type":"blog"},{"content":"Having a continues deployment to Heroku from Gitlab is hidden away, previous solutions I\u0026rsquo;ve found require putting in docker acrobatics into your .gitlab-ci.yml and a rest endpoint; but no more! The solution is simple for most projects.\nMaster is Prod # Leveraging the Gitlab repo mirror tool for only protected branches we can just provide the login for Heroku and we\u0026rsquo;re done! The steps below give more detail:\nInstall the Heroku cli Generate an access token Set up repo mirroring from your project to http://user@git.heroku.com/your-app.git The username is ignored The password is your access token And were done, pushes to master should mirror into Heroku, build and deploy automagically! ","date":"13 October 2020","externalUrl":null,"permalink":"/blog/deploying-to-heroku-from-gitlab/","section":"Blog","summary":"","title":"Deploying to Heroku from Gitlab","type":"blog"},{"content":"","date":"13 October 2020","externalUrl":null,"permalink":"/tags/gitlab/","section":"Tags","summary":"","title":"Gitlab","type":"tags"},{"content":"","date":"13 October 2020","externalUrl":null,"permalink":"/tags/heroku/","section":"Tags","summary":"","title":"Heroku","type":"tags"},{"content":"","date":"18 August 2020","externalUrl":null,"permalink":"/tags/email/","section":"Tags","summary":"","title":"Email","type":"tags"},{"content":"","date":"18 August 2020","externalUrl":null,"permalink":"/tags/homelab/","section":"Tags","summary":"","title":"Homelab","type":"tags"},{"content":"","date":"18 August 2020","externalUrl":null,"permalink":"/tags/open-media-vault/","section":"Tags","summary":"","title":"Open-Media-Vault","type":"tags"},{"content":"","date":"18 August 2020","externalUrl":null,"permalink":"/tags/zoho/","section":"Tags","summary":"","title":"Zoho","type":"tags"},{"content":"This is very useful to send automated emails with a custom domain, for example, notifications@tobydevlin.com - my notifications email I get things like Open Media Vault SMTP notifications and, eventually, client notifications if they need to interact with tobydevlin.com services.\nThe first step of this tutorial is to set up a Zoho email, potentially with a custom domain, by following this page.\nThe Settings # You\u0026rsquo;ll need an email user, I did this by adding a new user in the admin screen. Provide it with a password and log in with as the new user (or do this with your main user) Head to the \u0026ldquo;My Account\u0026rdquo; page then the security \u0026gt; app passwords page. Create a new App Password and make a note of it. Now you can set up an SMTP connection using the SMTP details that Zoho provides. These boil down to:\nOutgoing Server Name: smtp.zoho.eu Port: 587 Security Type: TLS Username: \u0026lsquo;\u0026lt;your mail address\u0026gt;\u0026rsquo; Password: The app password you noted earlier Here is the example from Open Media Vault:\n","date":"18 August 2020","externalUrl":null,"permalink":"/blog/zoho-smtp-setup/","section":"Blog","summary":"","title":"Zoho SMTP Setup","type":"blog"},{"content":"","date":"6 July 2020","externalUrl":null,"permalink":"/tags/django/","section":"Tags","summary":"","title":"Django","type":"tags"},{"content":" Django CI # Running isolated tests can be hard - we can solve this problem in Gitlab using the Docker images tooling provided by the runners. We will first provide the base Python image for us to run our code in and then add the Postgres container as a service\nThe Pipeline # The .gitlab-ci.yml will need to start 2 containers, the Python runtime and the Postgres Service. This is done using the following:\nimage: python:latest # must match lazydb3_api/settings/ci.py settings variables: POSTGRES_DB: test_db POSTGRES_USER: runner POSTGRES_PASSWORD: \u0026#39;ci\u0026#39; POSTGRES_HOST_AUTH_METHOD: trust # https://docs.gitlab.com/ce/ci/docker/using_docker_images.html#what-is-a-service # https://docs.gitlab.com/ce/ci/services/postgres.html services: - postgres:latest This will start the postgres using a couple settings, given in the variables. These are reusable, and we will leverage them later on. These settings must also be placed into your django database settings so you can connect. The service can be accessed with the identifier of the image, in this case postgres. This example provides the connection details:\npostgresql://runner:ci@postgres:5432/test_db Hence our Django settings will be:\nDATABASES[\u0026#39;default\u0026#39;] = { \u0026#39;NAME\u0026#39;: \u0026#39;test_db\u0026#39;, \u0026#39;ENGINE\u0026#39;: \u0026#39;django.db.backends.postgresql_psycopg2\u0026#39;, \u0026#39;USER\u0026#39;: \u0026#39;runner\u0026#39;, \u0026#39;PASSWORD\u0026#39;: \u0026#39;ci\u0026#39;, \u0026#39;HOST\u0026#39;: \u0026#39;postgres\u0026#39;, \u0026#39;PORT\u0026#39;: \u0026#39;5432\u0026#39;, } These are hardcoded but could be accessed with the env vars. The next step is to just run these as normal, as in the container you should now be able to connect.\n","date":"6 July 2020","externalUrl":null,"permalink":"/blog/django-tests-in-gitlab-ci/","section":"Blog","summary":"","title":"Django tests in Gitlab CI","type":"blog"},{"content":"This is a brief step-by-step guide to setting up a Plex \u0026amp; NAS enabled Pi. It assumes you kinda know what youre doing when it comes to flashing things, taking about networks and files, and using docker/containers on a high level.\nStep 1 - Get all you need # You\u0026rsquo;ll need the following\nThe Raspberry Pi OS with no Desktop. A copy of Etcher to flash your SD card. A copy of these files to add SSH, Network from boot, no need for anything other than power. Step 2 - Flash your SD card \u0026amp; prep files. # Using Etcher flash your SD card with the Raspberry Pi OS. Once this is done you can place the files you fetched from Github into the root dir. Remember to update the WiFi Passoword and name in the wpa_supplicant.conf. You should be able to boot the Pi now.\nStep 3 - Connect and start Open Media Vault install # Connect using SSH - Windows has a built in tool now, no need to get PuTTY unless you want to for other things. You can find the Pi IP using arp -a (it will be the new IP on the list). Installing the OMV is easy, use the following command:\nwget -O - https://raw.githubusercontent.com/OpenMediaVault-Plugin-Developers/installScript/master/install | sudo bash This will download and install all the bits required - it takes time, not is a good point for a coffee break.\nStep 4 - Access GUI and set up drives # Head over to your PIs web portal (at its IP) and log in, The default username is admin, and the default password is openmediavault:\nlog in using the admin creds Step 5 - Start up Docker and Portainer. # Open Media Vault has set up for Docker and Portainer naitivly, so theres no real trouble getting started. Once you\u0026rsquo;ve done this you can also start making a \u0026lsquo;docker-compose.yml\u0026rsquo; for your apps, for example Plex, PiHole, Netdata and Heimdall are shown below:\n# =============================== # RASPBERRY PI DEFAULT SERVICES # =============================== version: \u0026#39;2.4\u0026#39; services: # Plex for self hosting video \u0026amp; content. # Note: plex wont work if its not on the host network # exposed at \u0026lt;host\u0026gt;:32400/web/index.html Plex: image: linuxserver/plex:latest restart: unless-stopped network_mode: host volumes: # Custom mount paths - \u0026#39;/export/LittleRedBox/data/plex/config:/config\u0026#39; - \u0026#39;/export/LittleRedBox/data:/data\u0026#39; - \u0026#39;/export/LittleRedBox/data/plex/transcode:/transcode\u0026#39; # Netdata for monitoring the host Netdata: image: netdata/netdata:latest-armhf restart: unless-stopped ports: - 19999:19999 # Pihole for setting the PiHole: image: pihole/pihole:latest restart: unless-stopped ports: - 20053:53/tcp - 20053:53/udp - 20067:67/udp - 20080:80/tcp - 20443:443/tcp environment: WEBPASSWORD: \u0026#39;newpass\u0026#39; Heimdall: image: linuxserver/heimdall:latest restart: unless-stopped volumes: # Still need to set up database - \u0026#39;/heimdall/config:/config\u0026#39; ports: - 21080:80 - 21443:443 ","date":"7 June 2020","externalUrl":null,"permalink":"/blog/piplexed-a-guide-to-setting-up-a-home-media-server-on-raspberry-pi/","section":"Blog","summary":"","title":"PiPlexed - A guide to setting up a home Media Server on Raspberry Pi","type":"blog"},{"content":"","date":"7 June 2020","externalUrl":null,"permalink":"/tags/plex/","section":"Tags","summary":"","title":"Plex","type":"tags"},{"content":"","date":"27 March 2020","externalUrl":null,"permalink":"/tags/css/","section":"Tags","summary":"","title":"Css","type":"tags"},{"content":"","date":"27 March 2020","externalUrl":null,"permalink":"/tags/ghost/","section":"Tags","summary":"","title":"Ghost","type":"tags"},{"content":"","date":"27 March 2020","externalUrl":null,"permalink":"/tags/node/","section":"Tags","summary":"","title":"Node","type":"tags"},{"content":"Recently I had my friend @ veres.tech ask for some tips with a problem involving ParticleJS and Ghost. He wanted to apply this to his background, but because Ghost doesn\u0026rsquo;t automatically use the required format of the API ParticleJS provides we had to add some tweaks.\nParticleJS needs an ID, so we will have to make one from a node in the document. This node has to be uniquely identifiable. The above image had the element we were interested in with a unique class, m-hero__picture. Ghost uses this unique class to style the main image over every post. Below is a snippet from the homepage of the site.\n\u0026lt;section class=\u0026#34;m-hero with-picture aos-init aos-animate\u0026#34; data-aos=\u0026#34;fade\u0026#34;\u0026gt; \u0026lt;div class=\u0026#34;m-hero__picture\u0026#34;\u0026gt; \u0026lt;canvas class=\u0026#34;particles-js-canvas-el\u0026#34; style=\u0026#34;width: 100%; height: 100%;\u0026#34; width=\u0026#34;1903\u0026#34; height=\u0026#34;565\u0026#34; \u0026gt;\u0026lt;/canvas\u0026gt; \u0026lt;/div\u0026gt; \u0026lt;div class=\u0026#34;m-hero__content aos-init aos-animate\u0026#34; data-aos=\u0026#34;fade-down\u0026#34;\u0026gt; \u0026lt;h1 class=\u0026#34;m-hero-title bigger\u0026#34;\u0026gt;veres.tech\u0026lt;/h1\u0026gt; \u0026lt;p class=\u0026#34;m-hero-description bigger\u0026#34;\u0026gt; A technical blog for sysadmins and aspiring pentesters \u0026lt;/p\u0026gt; \u0026lt;/div\u0026gt; \u0026lt;/section\u0026gt; Using the js injection tool Ghost provides we can assign this unique class an id, and fulfil the ParticlJS API of requiring an id. We can do this by passing the below as a script tag:\nconst imageEl = document.getElementsByClassName(\u0026#34;m-hero__picture\u0026#34;)[0]; imageEl.id = \u0026#34;useThisID\u0026#34;; const mySettings = { /*my settings here*/ }; particlesJS(\u0026#34;useThisID\u0026#34;, mySettings); This will allow us to point ParticleJS at the right document node:\nAnd there you have it! We can redefine the ID for any uniquely identifiable class on the page and it will just work. Remember to make this unique, else you\u0026rsquo;ll either get the wrong element when indexing the [0] element or end up with non-unique IDs in the document causing warnings and errors.\n","date":"27 March 2020","externalUrl":null,"permalink":"/blog/using-ghost-particlejs/","section":"Blog","summary":"","title":"Using Ghost \u0026 ParticleJS","type":"blog"},{"content":"Adding TypeScript to gatsby shouldn\u0026rsquo;t be too hard, that\u0026rsquo;s why I created this post, detailing how to make everything using typescript in gatsby. This should be super simple, code changes should be minimal and can be run alongside the tutorial too.\nThis assumes you have a graphQL query somewhere in your code, as an example I will be using a basic index.js page to migrate to TypeScript:\nimport React from \u0026#34;react\u0026#34;; import { Link } from \u0026#34;gatsby\u0026#34;; import { graphql } from \u0026#34;gatsby\u0026#34;; import Layout from \u0026#34;../containers/layout\u0026#34;; function Index(props) { return ( \u0026lt;Layout\u0026gt; \u0026lt;h3\u0026gt;{props.data.site.siteMetadata.title}\u0026lt;/h3\u0026gt; \u0026lt;h2\u0026gt;Hello world! itsa me, toby!\u0026lt;/h2\u0026gt; \u0026lt;/Layout\u0026gt; ); } export const query = graphql` query IndexPageData { site { siteMetadata { title } } } `; export default Index; This also assumes you have the siteMetadata section in your gatsby-config.js, such as the below:\nmodule.exports = { siteMetadata: { title: `Try turning it off and on again`, }, plugins: [], }; Now we can get started on our Typescript integration; step 1, add typescript! This is via the gatsby-plugin-typescript, which is incredibly simple to add:\nInstall the plugin with npm install gatsby-plugin-typescript --save. Add the plugin to the plugins list of your gatsby-config.js, as detailed in the plugin docs. Rename your index.jsx to index.tsx. and you\u0026rsquo;re done! Gatsby will now transpile your Typescript for you and bundle the app as before, yet now you have type safety. That was the easy bit, now we have to add types to our GraphQL query. Thankfully there\u0026rsquo;s a plugin for that too! The gatsby-plugin-graphql-codegen will save the day. Using this is just as easy as using the gatsby-plugin-typescript, with 1 less step!\nInstall the plugin with npm install gatsby-plugin-graphql-codegen --save. Add the plugin to the plugins list of your gatsby-config.js. From here gatsby will generate a graphql-types.ts file on every change of your GraphQL named queries. Remember back to the beginning of this post, we can see the query is called IndexPageData; the resulting type generated for us will be called IndexPageDataQuery. Now we can extend our Index component with propTypes:\nimport React from \u0026#39;react\u0026#39;; import {Link} from \u0026#39;gatsby\u0026#39;; import {graphql} from \u0026#39;gatsby\u0026#39;; import Layout from \u0026#39;../containers/layout\u0026#39;; import {IndexPageDataQuery} from \u0026#39;../../graphql-types\u0026#39;; interface IndexProps { data: IndexPageDataQuery; } function Index(props: IndexProps) { return ( \u0026lt;Layout\u0026gt; \u0026lt;h3\u0026gt;{props.data.site.siteMetadata.title}\u0026lt;/h3\u0026gt; \u0026lt;h2 style={{color: \u0026#39;red\u0026#39;}}\u0026gt;Hello world! itsa me, toby!\u0026lt;/h2\u0026gt; \u0026lt;Link to={\u0026#39;/about\u0026#39;}\u0026gt;home\u0026lt;/Link\u0026gt; \u0026lt;img src=\u0026#34;https://source.unsplash.com/random/100x100\u0026#34; alt=\u0026#34;\u0026#34; /\u0026gt; \u0026lt;/Layout\u0026gt; ); } export const query = graphql` query IndexPageData { site { siteMetadata { title } } } `; export default Index; And that\u0026rsquo;s all there is. Every time you write a query, name it, and the type shall appear! Things to note though:\ntypes will be refreshed if you change the query name. this works with useStaticQuery() functions too. const siteData: AnotherQueryQuery = useStaticQuery(graphql` query AnotherQuery { site { siteMetadata { title } } } `); You will need to pass the type under the data prop for the mapping to work.\nhello world\n","date":"27 February 2020","externalUrl":null,"permalink":"/blog/adding-typescript-to-gatsby/","section":"Blog","summary":"","title":"Adding Typescript to Gatsby","type":"blog"},{"content":"","date":"27 February 2020","externalUrl":null,"permalink":"/tags/gatsby/","section":"Tags","summary":"","title":"Gatsby","type":"tags"},{"content":"","date":"27 February 2020","externalUrl":null,"permalink":"/tags/typescript/","section":"Tags","summary":"","title":"Typescript","type":"tags"},{"content":"","date":"27 February 2020","externalUrl":null,"permalink":"/tags/web/","section":"Tags","summary":"","title":"Web","type":"tags"},{"content":"JSON is everywhere, sometimes it\u0026rsquo;s hidden from you or your user but its always there. Thousands of applications rely on correctly formatted data to work correctly with little format checking. Most of the time the data generators should have been correctly tested with a couple of unit tests, maybe an integration test here and there, and then you can assume that consumers of said data will work correctly. Plop a few tests on there to make sure data in the format you assume is correct works, and you\u0026rsquo;re done, right? no.\nWhat if there\u0026rsquo;s a strange form creating data bypassing your validation somehow, or a REST service goes wild posts data to your service completely wrong, or maybe you got your stringification just a little off? here steps in the subtle art of JSON validation through JSON Schema!\nUsing the JSON Schema specification one can define a data model, using JSON itself, and then use one of the many libraries to validate a document against the said schema.\nThis is great in a few ways:\nYou can make sure your producers are always producing correctly \u0026amp; your consumers always consuming what they like. Leveraging the absolute nature of the JSON Schema spec means, with some work, one can produce automated solutions to create producing and consuming code. # 2, in my opinion, is a brilliant win for the specification. For example, if you\u0026rsquo;re writing a UI it means that you no longer have to struggle with building a form that\u0026rsquo;s just right. Tools like react-jsonschema-form or uniform do this for you, you bring your spec and it will give you a UI, validation and hooks all for free!\n# 1 also means that there are packages you can run your data through when expecting certain schemas to ensure the richness of your data. There are implementations of validation code written in almost every language too, just for your convenience!\n","date":"18 September 2019","externalUrl":null,"permalink":"/blog/the-fun-fun-world-of-json-validation/","section":"Blog","summary":"","title":"JSON Validation is Best Practice","type":"blog"},{"content":"","date":"29 April 2019","externalUrl":null,"permalink":"/tags/docker/","section":"Tags","summary":"","title":"Docker","type":"tags"},{"content":"Portainer is a great docker management GUI which is open source, hosted on GitHub. We\u0026rsquo;re going to shove it onto a raspberry pi and run a few images. This guide assumes a few things: _ You have a Pi already running Rasbarian Lite, connected to your network with ssh access _ You know what Docker is * You want to make managing a home server really easy. So, here we go:\n1 - SSH in and Get HackingOnce you\u0026rsquo;re in we need to get # Docker up and running on our machine. Run curl -sSL https://get.docker.com | sh and by magic docker will be pulled down and installed on the Pi.\n2 - Grab the Portainer Image and Go! # We got docker running, now we can launch the Portnainer image and see what happens. The image will want to have drives so we can persist data to memory and such we need to have a couple of extra args to the command. This works out as:\ndocker run -d -p 9000:9000 --name portainer --restart always -v /var/run/docker.sock:/var/run/docker.sock -v /path/on/host/data:/data portainer/portainer\nThis will also map port 9000 in the container to port 9000 of the Pi, so we can access the dash!\n3 - Reap the Spoils # Hit up http://\u0026lt;YOUR_PI_IP\u0026gt;:9000 and you got yourself a container dash to mess about with! Log in, make an admin user, set up local deployments \u0026amp; try deploying some stuff and see what containerization is all about!\n","date":"29 April 2019","externalUrl":null,"permalink":"/blog/easy-docker-containers-on-raspery-pi-with-portainer-io/","section":"Blog","summary":"","title":"Easy Docker Containers on Raspery Pi With Portainer.io","type":"blog"},{"content":"I want to be able to develop apps in my spare time. Modern, scalable, disposable apps. The current hip thing (other than serverless functions, but I\u0026rsquo;m more of a back end dev so here we are) is to build a containerised app. I\u0026rsquo;m looking to leverage docker and kubernetes to build an app which can scale almost infinitely depending on how much money I throw at it.\nKubernetes is a portable, extensible open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.\nMore specifically I want to create a home Kubernetes cluster (on one Raspberry pi, ill expand to 1+ later) which will allow me to develop and run things internal to my home network. Heres what we\u0026rsquo;re gonna do:\n1 - Set Up Our Host # Get a Raspberry Pi and shove on a Rasbarian lite distro. This is a great tutorial; it will walk you through flashing the iso to the drive.\n2 - Attach this Pi to your network # Grab the files from the GitHub below, then edit the wpa_supplicant.conf in a text editor to include your internet id and password. Add these 2 files in the root partition of the sd card after flashing (this lets you ssh \u0026amp; wifi to start on load) and then plug the Pi in and everything will just work. Or make a file called ssh in the SD card root partition, then plug the Pi in the ethernet with a wire\u0026hellip; Now, after it starts you can SSH into your Pi, I recommend using PuTTY. You can find the IP address of the Pi in your routers connections page. UPDATE YOUR PASSWORD FOR THE PI USER NOW also write it down. 3 - Installing the container runtime # Following the wisdom about writing code of any kind:\nSomebody has most likely done this before.\nWe won\u0026rsquo;t reinvent the wheel, Alex Ellis has some awesome code already written for this, so we just run https://github.com/alexellis/k8s-on-raspbian/blob/master/script/prep.sh | sudo sh and after a while, all the tools we need will be on the Pi. Now we just reboot and we have kubectl and kubeadm ready to use.\n**Note: ** it may be useful to run sudo apt-get update and sudo apt-get upgrade at this point before the reboot too to update everything.\nkubeadm helps you bootstrap a minimum viable Kubernetes cluster that conforms to best practices. With kubeadm, your cluster should pass Kubernetes Conformance tests. Kubeadm also supports other cluster lifecycle functions, such as upgrades, downgrade, and managing bootstrap tokens.\nNow, to start we will run kubeadm config images pull prior to kubeadm init to verify connectivity to gcr.io registries. Then we can init our k8s cluster with sudo kubeadm init --apiserver-advertise-address=0.0.0.0 to have the master listen on the default network IP too. It may take a while, but eventually, you should get an output like this:\nIf you have other worker nodes you can connect them with the command shown. Now we need to take the key generated for cluster administration and make it available in a default location for use with kubectl by running the following 3 commands:\nmkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config After this is run we can check that everything is working correctly with kubectl get all --all-namespaces, which should look a little like the below.\nHey presto we have a k8s cluster up and running correctly! You can test it running the minikube test deployments!\nLinks # GitToby/Raspbery-Pi-Setup-Files null 0 0 alexellis/k8s-on-raspbian Kubernetes on Raspbian (Raspberry Pi) Shell 887 127 ","date":"27 April 2019","externalUrl":null,"permalink":"/blog/home-network-kubernetes-setup/","section":"Blog","summary":"","title":"Kubernetes Setup on a Raspberry Pi","type":"blog"},{"content":"","date":"6 October 2018","externalUrl":null,"permalink":"/tags/mongodb/","section":"Tags","summary":"","title":"MongoDB","type":"tags"},{"content":" This is a POC for a MongoDB interface # Nothing major, but document model databases allow the chains to be lifted on certain applications that can make them amazingly flexable. They\u0026rsquo;re not without their caviats but knowing how to use one would be a tick of a box when being asked to make an app.\nThis example MongoDB for Python using Mongo Client. It assumes you have all the binarys for MongoDB installed, know where they are, and have a general unserstanding of what mongo is.\nLets begin; first lets start the server:\n# start a the mongo server using this in the cmd path # !\u0026#34;C:\\dev\\tools\\MongoDB\\bin\\mongod.exe\u0026#34; --dbpath=\u0026#34;./data/\u0026#34; The Mongo DB should now be started and the port number is displayed (typically 27017). Next step is connecting to it using python. This is done using a class so our database is essentially an API. Note: we\u0026rsquo;re using a local connection but this can be configured to be an external call too\nfrom pymongo import MongoClient class Connect(object): @staticmethod def get_connection(): return MongoClient(\u0026#34;localhost\u0026#34;, 27017) connection = Connect.get_connection() This connection is to our database now, were able to perform CRUD operations using it. Data is stored as documents in connections in databases. The connection is to the mongo client, so from here we have 3 levels of structure to go:\ndatabase collection document All of these can be accessed using the python accessor notation. Lets first access a (new) database:\n# test_db = connection.test # or test_db = connection[\u0026#39;test\u0026#39;] Notice this didn\u0026rsquo;t do anything; this is just a lazy evaluation of the DB, nothings been put in there so it\u0026rsquo;s just sitting there for now and will be created when something is added to a collection. So we now have to access a collection of this test_db:\n# test_collection = test_db.my_collection # or test_collection = test_db[\u0026#39;my_collection\u0026#39;] from here we can add a new element to our collection and start our database:\nmovie = {\u0026#34;title\u0026#34;: \u0026#34;Venom\u0026#34;, \u0026#34;year\u0026#34;: 2018, \u0026#34;description\u0026#34;: \u0026#34;When Eddie Brock acquires the powers of a symbiote, he will have to release his alter-ego \\\u0026#34;Venom\\\u0026#34; to save his life. \u0026#34;, \u0026#34;actors\u0026#34;: [\u0026#34;Tom Hardy\u0026#34;, \u0026#34;Michelle Williams\u0026#34;, \u0026#34;Jenny Slate\u0026#34;], \u0026#34;director:\u0026#34;: \u0026#34;Ruben Fleischer\u0026#34;, \u0026#34;tags\u0026#34;: [\u0026#34;Superhero\u0026#34;, \u0026#34;Modern\u0026#34;, \u0026#34;Thriller\u0026#34;]} x = test_collection.insert_one(movie) This addition will yield an object which can display the auto generated ID of the object it just added to the database.\nins_string = str(x.inserted_id) ins_string \u0026#39;5bc4f80f59cf703a9cd7f353\u0026#39; This ID is unique across documents and can be used as a \u0026ldquo;foreign key\u0026rdquo; type reference for this object. For example, we can search the database for elements with this id: (were using pretty printing so it\u0026rsquo;s nice to read)\nimport pprint from bson.objectid import ObjectId result = test_collection.find({\u0026#39;_id\u0026#39;:ObjectId(ins_string)}) pprint.pprint(list(result)) [{'_id': ObjectId('5bc4f80f59cf703a9cd7f353'), 'actors': ['Tom Hardy', 'Michelle Williams', 'Jenny Slate'], 'description': 'When Eddie Brock acquires the powers of a symbiote, he will ' 'have to release his alter-ego \u0026quot;Venom\u0026quot; to save his life. ', 'director:': 'Ruben Fleischer', 'tags': ['Superhero', 'Modern', 'Thriller'], 'title': 'Venom', 'year': 2018}] Again note the casting to list, this is because the find methods return a cursor which lazily evaluates. Since we have this data now in the database, let\u0026rsquo;s update it. We will touch on more intensive searching later\u0026hellip; maybe.\nThere\u0026rsquo;s a few things we need to understand about mongo to understand how updating works. These will make things much more simple after continued use.\nalways specify if you want to update one or many. Using update_one() will only update the first document it finds with the search criteria, whereas many will update_many() that match the criteria. the special keys $... will be very useful as they determine how the searching and adding of values is executed in different ways. The update_one() method will search with the first param and update the document with the second:\ny = test_collection.update_one({\u0026#39;_id\u0026#39;:ObjectId(ins_string)}, {\u0026#34;$set\u0026#34;: {\u0026#34;rating\u0026#34;: 4, \u0026#34;meta.user\u0026#34;:\u0026#34;toby\u0026#34;}, \u0026#34;$push\u0026#34;:{\u0026#34;tags\u0026#34;: {\u0026#34;$each\u0026#34;:[\u0026#34;Exciting\u0026#34;, \u0026#34;Dark\u0026#34;]}}, \u0026#34;$currentDate\u0026#34;: {\u0026#34;meta.lastModified\u0026#34;: True}}) y.raw_result {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} As you can see there\u0026rsquo;s a result about how many were updated after the query. let\u0026rsquo;s take a look at what that did to the document:\nresult = test_collection.find({\u0026#39;_id\u0026#39;:ObjectId(ins_string)}) pprint.pprint(list(result)) [{'_id': ObjectId('5bc4f80f59cf703a9cd7f353'), 'actors': ['Tom Hardy', 'Michelle Williams', 'Jenny Slate'], 'description': 'When Eddie Brock acquires the powers of a symbiote, he will ' 'have to release his alter-ego \u0026quot;Venom\u0026quot; to save his life. ', 'director:': 'Ruben Fleischer', 'meta': {'lastModified': datetime.datetime(2018, 10, 15, 20, 27, 46, 91000), 'user': 'toby'}, 'rating': 4, 'tags': ['Superhero', 'Modern', 'Thriller', 'Exciting', 'Dark', 'Exciting', 'Dark'], 'title': 'Venom', 'year': 2018}] Now we\u0026rsquo;ve created and updated a record we may as well get rid of it. We could in theory just tag it with a soft delete by updating the record and implementing filtering when querying, but we\u0026rsquo;re going to actually remove the document from here:\nz = test_collection.delete_one({\u0026#34;_id\u0026#34;:ObjectId(ins_string)}) z.raw_result {\u0026#39;n\u0026#39;: 1, \u0026#39;ok\u0026#39;: 1.0} Now lets just check that it worked:\nresult = test_collection.find({\u0026#39;_id\u0026#39;:ObjectId(ins_string)}) pprint.pprint(list(result)) [] And that\u0026rsquo;s the full lifecycle of the mongo data. This is a brief POC of the basic functionality, part 2 will put this into an API which is usable and queryable programmatically.\n","date":"6 October 2018","externalUrl":null,"permalink":"/blog/what-even-is-mongo/","section":"Blog","summary":"","title":"What Even Is Mongo?","type":"blog"},{"content":" This is an example of how i would use sympy to evaluate a set of analytical questions, for example: # Find \\(\\forall a \\in [0,12,24,30,99]\\) and \\(b=100\\):\n$$ \\int_a^b \\frac{\\alpha x^{3} - \\sin(2x)}{\\sqrt{\\beta e^x}} dx $$Where \\(\\alpha = 2.341\\), \\(\\beta = e^x\\)\nFirst we need to import the library (and start the printing for nice easy to read stuff):\nimport sympy as sp sp.init_printing() Now create the eqn above so we can evaluate it\na, b, alpha, beta, x = sp.symbols(\u0026#39;a b \\\\alpha \\\\beta x\u0026#39;) expr = (alpha * x ** 3 - sp.sin(2 * x)) * (1 / sp.sqrt(beta * sp.E ** x)) expr $$ \\frac{1}{\\sqrt{\\beta e^{x}}} \\left(\\alpha x^{3} - \\sin{\\left (2 x \\right )}\\right) $$This like the integrand we hav above. Now theres a few things we can do, if we know \\(\\alpha\\) and \\(\\beta\\), which we do, we can sub them in. then we will have a more simple expression:\nexpr2 = expr.subs({alpha: 2.341, beta: sp.E ** x}) expr2 $$ \\frac{1}{\\sqrt{e^{2 x}}} \\left(2.341 x^{3} - \\sin{\\left (2 x \\right )}\\right) $$If we wanted we could now evaluate this using the .subs() command again (this time with a value for x such as .subs(x,3)), but we\u0026rsquo;re asked to evalue the integral:\nintegral = sp.integrate(expr2, (x, a, b)) integral $$ \\frac{2.341 a^{3}}{\\sqrt{e^{2 a}}} + \\frac{7.023 a^{2}}{\\sqrt{e^{2 a}}} + \\frac{14.046 a}{\\sqrt{e^{2 a}}} - \\frac{2.341 b^{3}}{\\sqrt{e^{2 b}}} - \\frac{7.023 b^{2}}{\\sqrt{e^{2 b}}} - \\frac{14.046 b}{\\sqrt{e^{2 b}}} + \\frac{0.2}{\\sqrt{e^{2 b}}} \\sin{\\left (2 b \\right )} + \\frac{0.4}{\\sqrt{e^{2 b}}} \\cos{\\left (2 b \\right )} - \\frac{14.046}{\\sqrt{e^{2 b}}} - \\frac{0.2}{\\sqrt{e^{2 a}}} \\sin{\\left (2 a \\right )} - \\frac{0.4}{\\sqrt{e^{2 a}}} \\cos{\\left (2 a \\right )} + \\frac{14.046}{\\sqrt{e^{2 a}}} $$Now we need to calculate all the values of this expression for every combination of \\(a \\in [0,12,24,30,99]\\) and \\(b=100\\). We can do this by starting with lambdifying the expression into a callable function:\nmy_callable = sp.lambdify((a, b), integral) my_callable(1, 2) # this will sub in a=1 and b=2 to the above. $$ 1.6786017689393038 $$Now we just loop over the varable s for a and we can print out the values:\nfor a in [0, 12, 24, 30, 99]: print(\u0026#34;a =\u0026#34;, a) print(my_callable(a, 100)) print() a = 0 13.645999999999999 a = 12 0.03219056961943131 a = 24 1.3876938438535285e-06 a = 30 6.546926902186657e-09 a = 99 1.470461229758446e-37 This decreasing sequence is expected as were taking essentially taking area slices of the below plot, moving further left each time as a starting point.\n%matplotlib inline import matplotlib.pyplot as plt X = [x for x in range(0, 101)] expr2_callable = sp.lambdify(x, expr2) plt.plot(X, [expr2_callable(x) for in_x in X]) [\u0026lt;matplotlib.lines.Line2D at 0x24a166e8668\u0026gt;] And we\u0026rsquo;re done! # ","date":"25 August 2018","externalUrl":null,"permalink":"/blog/using-sympy-for-analytical-maths/","section":"Blog","summary":"","title":"Using Sympy for Analytical Maths","type":"blog"},{"content":"This is a copy of my Game Theory Coursework I completed for the course at cardiff. Personally I found the subject fascinating but the course very introductory.\n","date":"20 March 2018","externalUrl":null,"permalink":"/blog/maths-gt-coursework/","section":"Blog","summary":"","title":"Game Theory Coursework","type":"blog"},{"content":"","date":"20 March 2018","externalUrl":null,"permalink":"/tags/maths/","section":"Tags","summary":"","title":"Maths","type":"tags"},{"content":"MiKTex + VS Code + Git = a semi working, compiling, version controlled version of LaTex. Try not to break anything on your journey tho; this worked for me so it will probably work for you\u0026hellip; Heres how to do it:\nFirst Thing: Get The Stuff: # MiKTeX Git VS Code Install both of these to wherever your preferred location is; Once this is done it might be useful to add them to your path. Control Panel \u0026gt; System and Security \u0026gt; System \u0026gt; Advanced System Settings \u0026gt; Environment Variables. You should restart your machine before step 2.\nThen: Set up VS Code # In VS Code you\u0026rsquo;ll need to install the LaTaX Workshop using the tools in VS Code.\nOnce this is done, and you\u0026rsquo;ve restarted VS Code, the compile User Settings will need changing before it will work. (It might work; it depends on if you have perl, I don\u0026rsquo;t and I wanted to make the number of installs minimal)\n\u0026#34;latex-workshop.latex.toolchain\u0026#34;: [ { \u0026#34;command\u0026#34;: \u0026#34;latexmk\u0026#34;, \u0026#34;args\u0026#34;: [ \u0026#34;-synctex=1\u0026#34;, \u0026#34;-interaction=nonstopmode\u0026#34;, \u0026#34;-file-line-error\u0026#34;, \u0026#34;-pdf\u0026#34;, \u0026#34;%DOC%\u0026#34; ] } ] needs changing to:\n\u0026#34;latex-workshop.latex.toolchain\u0026#34;: [ { \u0026#34;command\u0026#34;: \u0026#34;pdflatex\u0026#34;, \u0026#34;args\u0026#34;: [ \u0026#34;-synctex=1\u0026#34;, \u0026#34;-shell-escape\u0026#34;, \u0026#34;-interaction=nonstopmode\u0026#34;, \u0026#34;-file-line-error\u0026#34;, \u0026#34;%DOC%\u0026#34; ] }, { \u0026#34;command\u0026#34;: \u0026#34;bibtex\u0026#34;, \u0026#34;args\u0026#34;: [ \u0026#34;%DOCFILE%\u0026#34; ] }, { \u0026#34;command\u0026#34;: \u0026#34;pdflatex\u0026#34;, \u0026#34;args\u0026#34;: [ \u0026#34;-synctex=1\u0026#34;, \u0026#34;-shell-escape\u0026#34;, \u0026#34;-interaction=nonstopmode\u0026#34;, \u0026#34;-file-line-error\u0026#34;, \u0026#34;%DOC%\u0026#34; ] }, { \u0026#34;command\u0026#34;: \u0026#34;pdflatex\u0026#34;, \u0026#34;args\u0026#34;: [ \u0026#34;-synctex=1\u0026#34;, \u0026#34;-shell-escape\u0026#34;, \u0026#34;-interaction=nonstopmode\u0026#34;, \u0026#34;-file-line-error\u0026#34;, \u0026#34;%DOC%\u0026#34; ] } ] Save this and now go to your file you want to compile (if you want an example there is some code that will compile at the end of this tutorial). If youre checking out from Git you\u0026rsquo;ll need to have done this all ready with git checkout http://url/path.git. If its not working there might be a problem with your git install or adding it to your PATH.\nnote: \u0026quot;-shell-escape\u0026quot; is required for the package minted Also \u0026quot;latex-workshop.latex.clean.enabled\u0026quot;: false, can be changed to true to delete files after the project is built\nFinally: Get everything built and looking smart: # Building your project should be easy enough; either run the build with Right Click \u0026gt; Build LaTeX project or Ctrl + Alt + L \u0026gt; Build LaTeX project or just save the file after editing it. Then, if settings were changed then 4 steps should run and everything should work: If it fails checking the compiler logs will tell you why. It\u0026rsquo;s probably a syntax error (them pesky buggers get me every time), or a an issue with packages (make sure you tab over and install them the first time they\u0026rsquo;re being used).\nGOOD LUCK!\n\\documentclass[12pt]{article} \\begin{document} Yay it worked! \\[f(x) = \\fract{3}{2x} - 8\\] \\end{document} Or try a more complicated one!\n","date":"24 February 2018","externalUrl":null,"permalink":"/blog/installing-latex-on-windows/","section":"Blog","summary":"","title":"Installing LaTeX on Windows","type":"blog"},{"content":"","date":"24 February 2018","externalUrl":null,"permalink":"/tags/latex/","section":"Tags","summary":"","title":"LaTeX","type":"tags"},{"content":"","date":"22 February 2018","externalUrl":null,"permalink":"/tags/math/","section":"Tags","summary":"","title":"Math","type":"tags"},{"content":"To get straight to the point, part of my degree includes some data analytics using python; part of trawling the web for learning materials has given me a number of useful methods to use. Unfortunetly these snippits of code never seem to be in the same place, so I\u0026rsquo;m collating them here:\nBefore I start heres a list of the python libraries mentioned (all available using pip or included in Anaconda):\npandas (as pd) - Data representation matplotlib.pyplot (as plt) - Plotting and Visuals numpy (as np) - Numerics seaborn (as sns) - Stats Plotting and Visuals I used Jupyter Notebooks that are included in an Anaconda distribution, but they can be found on hosted services too (give it a Google). This is a very basic explenation into what can be done and is super generic so I cant class this as a tutorial, more of a cheat page. I found good sources of data on Kaggle (you can do analysis right there on kaggle, its v. cool, but for this example I\u0026rsquo;m using a csv at the end).\nThis Section is split up into 6 parts: # Creating \u0026amp; Collecting Data Exploring Data Null Category Numeric Cleaning \u0026amp; \u0026lsquo;Munging\u0026rsquo; Data Visualisation Of Data Creating Models Closing Notes Pt 1 - Creating \u0026amp; Collecting Data # Mostly using the Pandas Dataframe object, mostly from a csv:\ndf = pd.DataFrame.from_csv(file, index_col=None) or with columns:\ncols = [\u0026#34;col1\u0026#34;,\u0026#34;col2\u0026#34;,..] df = pd.DataFrame.from_csv(file, index_col=cols) Depending on the Type of csv I\u0026rsquo;m reading. If it contains column titles then Pandas usually picks up on it quickly. If you\u0026rsquo;ve got a Dataframe and need to get its columns use a list: cols = list(df)\nOnce you have 1 or more DataFrames (possibly in a list) you\u0026rsquo;re able to merge them in different ways:\nPut another df(s) on the bottom if they have the same attributes: df=pd.concat([df1,df2,...],ignore_index=True) Just appening with df.append(df) (a more relaxed version of the above) Put them on the side (if they have a common column): df3 = df1.merge(df2,how='left') Add another column: df[\u0026quot;new_col\u0026quot;]=data Dropping columns (or rows with axis=0): df.drop(axis=1,labels=[\u0026quot;col\u0026quot;]) Dynamically adding data to a DataFrame can be done like this:\ndf = pd.DataFrame(columns=[\u0026#34;a\u0026#34;,\u0026#34;b\u0026#34;,\u0026#34;c\u0026#34;]) x = {\u0026#39;a\u0026#39;:123,\u0026#39;b\u0026#39;:\u0026#34;yes\u0026#34;,\u0026#34;c\u0026#34;:\u0026#34;no\u0026#34;} df = df.append(x,ignore_index=True) y = {\u0026#39;a\u0026#39;:12323,\u0026#39;b\u0026#39;:\u0026#34;afgyes\u0026#34;,\u0026#34;c\u0026#34;:\u0026#34;adfno\u0026#34;} df = df.append(y,ignore_index=True) df = df.append([x,y],ignore_index=True) This will add the rows x then y then both x \u0026amp; y: Pt 2 - Exploring Data # In order to clean data we have to have a looksee and find missing data, data that seems incorrect and such. There are techniques to finding these errors/mishaps easily:\ndf.info() df.describe() df.head() This is the first thing I do, it lets me see what my data kinda looks like \u0026amp; some summary of the info. If theres anything horribly wrong you can probably find it here (like it having/ not having column names or what not)\nNull Data # Looking up Null data, be it NaN, None, Null\u0026hellip; it should come up using this.\ndf.apply(lambda x: sum(x.isnull()),axis=0) This will count up the number of null values in each column, a bit like so: Categorical Data # Counting values for categories (it works with numerics but its not a great idea).\ndf[\u0026#34;col\u0026#34;].value_counts() This will give you the number of categories there are in each column(it wont tell you how many nas there are though): Numeric Data # Before looking into distributions and stuff its nice to follow the following steps:\nAre there any missing/inf/NaN/None/Null Data? fix that stuff (see pt3) Are there any crazy outliers? decide what you want to do about them (This has some nice theory for what to do if theres a problem) Start looking into the data Summarizing data is easily done with the describe method; this will produce an unsaved table that covers some of the useful numbers needed to spot issues/trends.\ndf.descibe() Otherwise, after fixing the problems in steps 1 \u0026amp; 3 above, you can look into the pt 4 for pretty ways of plotting the data.\nBinning data is also another useful technique, splitting up your range into equal size bins ad giving them labels means you can look into sections of observations very easily (by using the df.groupby() method for example).\nlabel_names = [\u0026#34;very low\u0026#34;,\u0026#34;low\u0026#34;, \u0026#34;medium\u0026#34;, \u0026#34;high\u0026#34;, \u0026#34;very high\u0026#34;] df[\u0026#34;new_bin_col\u0026#34;]=pd.cut(df[\u0026#34;col\u0026#34;],5 , labels=label_names) This will create a new column that contains the bin labels depending on the scores given in column \u0026ldquo;col\u0026rdquo;; whatever an observations score is, it will be mapped to a label and this label will be given in this new column.\nPt 3 - Cleaning \u0026amp; \u0026lsquo;Munging\u0026rsquo; Data # Munging - to modify,in an easily reversible way, some data. ie, make changes to create more informative analysis, in our case.\nFilling in Data # After identifying missing data we have to do something with it; sometimes refilling these missing points with an average (numerics) or a new category (categorical) is an easy enough fix without affecting the true result too much: (depends on what you\u0026rsquo;re looking at)\ndf[\u0026#39;numeric_col\u0026#39;].fillna(df[\u0026#39;numeric_col\u0026#39;].mean(), inplace=True) df[\u0026#39;category_col\u0026#39;].fillna(\u0026#34;new_category\u0026#34;, inplace=True) This should be done after identifying what type of data you\u0026rsquo;ve got, what result you\u0026rsquo;re looking for and what the possible implications this could have. In some cases its easier and more effective to just remove it.\nDropping Data # Dropping rows(axis=0) or columns (axis=1) if they have \u0026lsquo;all\u0026rsquo; or \u0026lsquo;any\u0026rsquo; NAs, or even only if they have more NAs than a given thresh hold.\ndf.dropna(axis=0,how=\u0026#39;any\u0026#39;) df.dropna(axis=1,thresh=2) The docs page is probably a good read so you don\u0026rsquo;t drop all your data by accident.\nMapping Data # If youre wanting to create a catagory of data you can use a map to a dictionary, the example below code means we remap Males to 1 and Females to 0.\ndf[\u0026#39;Sex\u0026#39;] = df[\u0026#39;Sex\u0026#39;].map( {\u0026#39;female\u0026#39;: 0, \u0026#39;male\u0026#39;: 1} ).astype(int) Pt 4 - Visualisation Of Data # Before you start looking through the numerical data plotting you have it might be more useful to look into Pt 3 to fix any imperfections. Plots will fail or look funny if the data has some crazy outliers or missing values.\nLooking into the distribution/correlation of your data sets? Seaborn has pairplot which is incredibly useful for looking into these properties:\nplot = sns.pairplot(df) plot.savefig(\u0026#34;save/path.pdf\u0026#34;) note: This will eat up your memory for a few mins\u0026hellip; (Theres not really any relationship in this data, but you get the point of what it does; the middle ones are histograms and the others are scatter plots)\nIf you\u0026rsquo;re interested in just one or two columns, then Seaborn has some nice histograms and correlation visualisation tools:\nHistograms # sns.distplot(df[\u0026#34;col\u0026#34;]) Multi Linear Regression # sns.lmplot(x=\u0026#34;col1\u0026#34;, y=\u0026#34;col2\u0026#34;, data=df) Boxplots # sns.boxplot(x=\u0026#34;col\u0026#34;, y=\u0026#34;col\u0026#34;, data=df) These are just a few of the offerings by Seaborn with lots of others that can be used depending on the type of data. For example the jointplot() gives lots of information about the relationship between 2 variables.\nPt 5 - Creating Models # This part is probably the hardest\u0026hellip;\nTorture the data, and it will confess to anything. ~ Ronald Coase, Economics, Nobel Prize Laureate\nIf you apply a model to data where the model doesn\u0026rsquo;t make sense you can come away with some very strange answers. Scikit Learn is the main ML library that is used to build models in python, and it has a very useful cheat sheet:\nTheir documentation is probably the best learning material on how the code is written; but if you want to know why models work it might be more useful to do a deep dive course to understand the maths behind them.\nPt 6 - Closing Notes # This blog post is long and probably not structured as well as I think it is, so apologies for that. If it does help then I hope your data science work continues to flourish. Further reading would just be some useful things that have come up in my work which, once explained properly to me, explained why I wasn\u0026rsquo;t seeing results other people did and why my predictions where usually quite far off.\nFurther Reading:\np values - for significance tests (the usage section is a brief overview) Simpson Paradox - Sneaky hidden data correlations Confounding - More Sneaky data issues Misuse examples - for what not to do when conducting analysis ","date":"22 February 2018","externalUrl":null,"permalink":"/blog/some-really-useful-pandas-methods-for-data-analysis/","section":"Blog","summary":"","title":"Some Useful Python Methods For Data Analysis","type":"blog"},{"content":" For the AI module in the Computer Science department you have to have a basic understanding of Probability and Inference. Below is an introduction to the probability details covered.\nFirst off there are a few things we have to cover:\n\\(p(A)=0.5\\) means that event \\(A\\) has a \\(0.5\\) or \\(50\\%\\) chance of occurring \\(p(A,B)=0.5\\) means that the events \\(A\\) and \\(B\\) have a \\(0.5\\) or \\(50\\%\\) chance of both occurring at the same time \\(p(A,B|C)=0.5\\) means that the events \\(A\\) and \\(B\\) have a \\(0.5\\) or \\(50\\%\\) chance of both occurring at the same time given that event \\(C\\) has occurred. This can be written as \\(\\frac{(p(A,B,C)}{p(C)}\\) These are pretty basic concepts, and we only really need a few identities to solve all the problems in the exam:\nONE: Bayes Theorem\n$$ \\frac{p(A,B)}{p(B)}=p(A|B)=\\frac{p(B|A)p(A)}{p(B)} $$Two: Partition Theorem\n$$ P(A) = \\sum^{n}\\_{i=1} p(A|B_i)p(B_i)\\qquad \\text{ for some partion of the space: } S=\\cup^{n}\\_{i=1}B_i $$Three: Chain Rule\n$$ p(A_1,A_2,\\ldots,A_k) = \\prod^{k}\\_{j=1}p(A_j|A_1,\\ldots,A_{j-1}) $$Four: Naive Bayes\n$$ p(A_2|B)=\\alpha p(B|A_2) p(A_2) \\\\\\\\ \\ldots \\\\\\\\ p(A_m|B)=\\alpha p(B|A_m) p(A_m) $$For some partition of the space: \\(S=\\cup^{m}\\_{i=1}A_i\\). Note this will probably require conditional independence. i.e. for \\(A\\) and \\(B\\) to be CI given \\(C\\) we can write: \\(p(A,B|C)=p(A|C)p(B|C) also: p(A|B,C)=p(A|C)\\)\nThese 3 are used whenever a question comes up. There are lots of possible Questions but there will always be the required combinations given. Splitting up what is asked for you buy the question in a certain way will provide an evaluable line.\nHere were given:\n$$ \\begin{bmatrix} p(A)=0.75 \u0026 p(I)=0.2 \u0026 p(W)=0.05 \\\\ p(wh|A)=0.15 \u0026 p(wh|I)=0.5 \u0026 p(wh|W)=0.1 \\\\ p(b|A)=0.15 \u0026 p(b|I)=0.1 \u0026 p(b|W)=0.2 \\end{bmatrix} $$Where: \\(A\\)=android, \\(I\\)=iOS, \\(W\\)=windows, \\(wh\\)=white \u0026amp; \\(b\\)=british. We are asked to find:\nHere, notice it says being white and a british sim are independent. thus\n$$ \\alpha p(wh,b|I) p(I) = \\alpha p(wh|I) p(b|I) p(I) $$Now, we just need to find \\(\\alpha\\) by computing this for all the partition sections: \\(A\\)=android, \\(I\\)=iOS, \\(W\\)=windows:\n$$ p(A|wh,b)=\\alpha p(wh,b|A) p(A) = \\alpha p(wh|A) p(b|A) p(A) \\\\ p(W|wh,b)=\\alpha p(wh,b|W) p(W) = \\alpha p(wh|W) p(b|W) p(W) $$or:\n$$ p(A|wh,b) = \\alpha 0.15 \\times 0.15 \\times 0.75 \\\\ p(W|wh,b) = \\alpha 0.1 \\times 0.2 \\times 0.05\\\\ $$Note that this is itself a partition so they must add up to 1: $$ \\alpha (0.5 \\times 0.1 \\times 0.2) + (0.15 \\times 0.15 \\times 0.75) + (0.1 \\times 0.2 \\times 0.05) = 1 $$thus, $$ \\alpha = \\frac{1}{(0.5 \\times 0.1 \\times 0.2) + (0.15 \\times 0.15 \\times 0.75) + (0.1 \\times 0.2 \\times 0.05)} =\\frac{80}{223} $$Now we have found alpha, its easy to calculate whichever value we need: $$ p(I|wh,b)= \\frac{80}{223} p(wh|I) p(b|I) p(I) = \\frac{80}{223} (0.5 \\times 0.1 \\times 0.2) \\approx 0.3587 $$And were done for this question. Boom, 10 marks!\n","date":"17 January 2018","externalUrl":null,"permalink":"/blog/probability-and-inference/","section":"Blog","summary":"","title":"Probability and Inference","type":"blog"},{"content":" This is an assumption proof given in the Cardiff Uni Maths Coding Theory and Data Compression Course. It makes sense if you understand what we mean by \u0026ldquo;most files.\u0026rdquo;; i.e. literally any random string of data.\nSo, why, in most cases, can\u0026rsquo;t any old file be compressed? Lets start off at the beginning obvious: let \\(A,B\\) be files, and they get compressed to \\(C,D\\) respectively. Now \\(C=D\\) if and only if \\(A=B\\). So every compression must be unique if its source is unique.\nNow, we can look into what happens when we have every possible combination of a file length.\nLet \\(|A|=n\\) then once compressed it removes 10% of the redundant data, or \\(|C|=0.9n\\).\nFor any file with length \\(n\\) we have \\(2^{n}\\) possible files (for every ordering of data) each of which would need a compressed equivalent.\nThe number of file sizes up to \\(2^{0.9n}\\) is:\n$$ 2^0 + 2^1 + \\ldots + 2^{0.9n} \\approx 2^{0.9n +1} $$This ratio of #of files to #of compressed possibilities: \\(2^{0.9n +1}\\)/\\(2^n\\) tends to 0 as \\(n \\rightarrow \\infty \\)\nThe concept of Data compression means to reduce redundant data in a file, however most of the \\(2^n\\) possibilities are random and have no redundant data. Data compression will only work on files where there is a possibility to remove redundancy that exists.\n","date":"14 January 2018","externalUrl":null,"permalink":"/blog/maths-why-most-files-cant-be-compressed/","section":"Blog","summary":"","title":"Coding  Theory - Why Most Files Can't Be Compressed","type":"blog"},{"content":"","date":"14 January 2018","externalUrl":null,"permalink":"/tags/coding-theory/","section":"Tags","summary":"","title":"Coding-Theory","type":"tags"},{"content":"Headless Plex server on a Raspberry Pi for all your own media to stream anywhere sound interesting? This will be brief because it\u0026rsquo;s more or a reference for myself. All the files I used are sitting on this github page! Feel free to open an issue (or google the problem ­ƒæì­ƒæì) if its not working for you.\nFirst, flash raspbian lite onto the drive using the image from the internet and win32 imager\nEdit wpa_supplicant.conf in a text editor to include your internet id and password. Add these 2 files in the root partition (this lets you ssh \u0026amp; wifi to start on load)\nI formatted any external drives to be on an ntfs format, named whatever you want.\nPlug everything in and power up the pi. Connect using PuTTY or some other SSH client (you can find the internal ip of the pi on your routers DHCP Table) using the default user and pass: pi \u0026amp; raspberry. (change these when you get in for security) Then we can mount our external drives using the following steps:\nCreate directories you want to keep all your mounted drives in using sudo mkdir /media/SOMETHING-1. Do this for each external drive. I have: pi@TOBY-rpi:/media $ ls -la total 16 drwxr-xr-x 3 root root 4096 Dec 27 22:57 . drwxr-xr-x 22 root root 4096 Dec 27 22:24 .. dr-xr-xr-x 1 root root 4096 Dec 26 15:09 Overflow Get the UUID of the disks by running the following command: sudo blkid to give something like: (note im only mounting Overflow, not Black-Box) pi@TOBY-rpi:/ $ sudo blkid /dev/mmcblk0p1: LABEL=\u0026#34;boot\u0026#34; UUID=\u0026#34;CDD4-B453\u0026#34; TYPE=\u0026#34;vfat\u0026#34; PARTUUID=\u0026#34;bc94ccd3-01\u0026#34; /dev/mmcblk0p2: LABEL=\u0026#34;rootfs\u0026#34; UUID=\u0026#34;72bfc10d-73ec-4d9e-a54a-1cc507ee7ed2\u0026#34; TYPE=\u0026#34;ext4\u0026#34; PARTUUID=\u0026#34;bc94ccd3-02\u0026#34; /dev/sda1: LABEL=\u0026#34;New Volume\u0026#34; UUID=\u0026#34;5C10-9CA7\u0026#34; TYPE=\u0026#34;exfat\u0026#34; PARTUUID=\u0026#34;14112f0f-01\u0026#34; /dev/sda2: LABEL=\u0026#34;Overflow\u0026#34; UUID=\u0026#34;8096E11896E11008\u0026#34; TYPE=\u0026#34;ntfs\u0026#34; PARTUUID=\u0026#34;14112f0f-02\u0026#34; /dev/sdb1: LABEL=\u0026#34;Black-Box\u0026#34; UUID=\u0026#34;D43ADA443ADA2372\u0026#34; TYPE=\u0026#34;ntfs\u0026#34; PARTUUID=\u0026#34;42751a44-01\u0026#34; Find the disk partition(s) from the list you want to add and note the UUID. Not the PARTUUID\nEdit the fstab file using the command sudo nano /etc/fstab. Add lines in the fstab file like this (change the #values#) for each of your drives\nUUID=#YOUR-UUID# #/media/SOMTHING-1# ntfs defaults,auto,umask=000,users,rw 0 0 I then added to the file the last line so it looked like this:\nproc /proc proc defaults 0 0 PARTUUID=bc94ccd3-01 /boot vfat defaults 0 2 PARTUUID=bc94ccd3-02 / ext4 defaults,noatime 0 1 UUID=8096E11896E11008 /media/Overflow ntfs defaults,auto,umask=000,users,rw 0 0 NOTE: If you boot up the pi with a drive in this file not attached it will fail to boot correctly and there will be no wifi, or any other connection (so no SSH). If this happens, reboot with all the drives attached and remove the UUID from the fstab file to allow a correct boot.\nRestart your pi with sudo reboot. Once you reboot the pi for the first time after making changes there might be a 3-4 min wait (while the pi looks through the disk(s)) after your reboot before the pi responds using SSH or over the web.\nYou should now be able to go check the files exist where you said they should exist. If so, CONGRATS! You can now put on the plex server, following this is how I finished off the rest of the project. No need to duplicate a good walkthrough.\nHappy Coding!\n","date":"27 December 2017","externalUrl":null,"permalink":"/blog/setting-up-external-drives-for-a-plex-server-on-a-raspberry-pi/","section":"Blog","summary":"","title":"Setting up external drives for a Plex server on a Raspberry Pi","type":"blog"},{"content":" The Data Compression course covers a variety of compression techniques that must be learned. Some are simple, and some are complicated, but all are not as hard as learning how computers actually work.\nLossless Techniques # Shannon Coding # Possibly the simplest, this is purely for research and isnt really used anywhere. We will start with the following properties:\n$$ A=\\{x_1,\\ x_2,\\ x_3\\} $$ $$ P=\\{\\frac{1}{2},\\frac{3}{8},\\frac{1}{8}\\} $$Now we start the steps:\nUsing the probabilities in \\(P\\) create the cumulative probabilities list \\(P_C\\), starting at \\(0\\): $$ P_C=\\{0,\\frac{1}{2},\\frac{7}{8}\\}=\\{0,\\ 0.5,\\ 0.875\\} $$ This will be used, once converted to binary, to be a the representations of our words. This will require us to know how many bits of the binary form to include however, that is step 2.\nTo find the number of bits each word requires we need to find the self-information of the word. This is done using the self-information function: $$ I(a_i) = \\lceil-log_2(p_i)\\rceil $$ This gives: $$ I(A)=\\{1,2,3\\} $$ Which represents the lengths of each of the expansions for the probabilities found in step 1.\nNow we can convert \\(P_C\\) to binary up to the length given in \\(I(A)\\), this can be done any way you want. I use the multiplying expansion technique (not shown). This gives:\n$$ 0\\rightarrow 0\\quad | \\quad 0.5\\rightarrow 10,\\quad | \\quad 0.875\\rightarrow 111 $$These are the expansions for the respective codewords: $$ x_1=0\\quad | \\quad x_2=10,\\quad | \\quad x_3=111 $$ This gives a uniquely decodable prefix code but it may not be optimal, the average compression length is: $$ L_{avg}=\\sum^{|A|}_{i=1} l_i p_i = (1\\cdot \\frac{1}{2})+(2\\cdot \\frac{3}{8})+(2\\cdot \\frac{1}{8}) = \\frac{13}{8} = 1.625 $$ Shannon-Fano Coding # Shannon-Fano coding is very hard to describe mathematically, It leverages the property that binary trees will create prefix codes if the leaves represent words. we will start with a longer code alphabet than before: $$ A=\\{a,\\ b,\\ c,\\ d,\\ e,\\ f\\} $$ $$ P=\\{0.05,\\ 0.1,\\ 0.12,\\ 0.13,\\ 0.17,\\ 0.43\\} $$ The first step is to sort the alphabet by the probabilities: $$ A=\\{f,\\ e,\\ d,\\ c,\\ b,\\ a s\\} $$ $$ P=\\{0.43,\\ 0.17,\\ 0.13,\\ 0.12,\\ 0.1,\\ 0.05 \\} $$ Now the probabilities are sorted, make this the \u0026ldquo;root\u0026rdquo; of the tree, then split the list in half by weighting the probabilities so they\u0026rsquo;re equal on both sides. Now append 0 to the left groups members, and 1 to the right. $$ 0:[f:0.43,e:0.17]\\ |\\ 1:[d:0.13,c:0.12,b:0.1,a:0.05] $$ Repeat step 2 for each of the groups in until you get groups of order 1, this is essentially constructing a tree: $$ 00:[f:0.43] |\\ 01:[e:0.17]\\ |\\ 10:[d:0.13,c:0.12]\\ |\\ 11:[b:0.1,a:0.05] $$ $$ 00:[f:0.43]\\ |\\ 01:[e:0.17]\\ |\\ 100:[d:0.13]\\ |\\ 101:[c:0.12]\\ |\\ 110:[b:0.1]\\ |\\ 111:[a:0.05] $$ or, as a tree:\nThis gives a uniquely decodable prefix code but it may not be optimal, the average compression length is: $$ L_{avg}=\\sum^{|A|}_{i=1} l_i p_i = (2\\cdot 0.43)+(2\\cdot 0.17)+(3\\cdot 0.13)+(3\\cdot 0.12)+(3\\cdot 0.1)+(3\\cdot 0.05)= 2.4 $$ Hufman Coding # Hufman coding will always give an optimal tree, but again is hard to describe mathematically. It uses forrests of nodes that are joined in certain orders to create a tree the same as Shanon-Fano coding. We will use the same code as in the Shanon-Fano example: $$ A=\\{a,\\ b,\\ c,\\ d,\\ e,\\ f\\} $$ $$ P=\\{0.05,\\ 0.1,\\ 0.12,\\ 0.13,\\ 0.17,\\ 0.43\\} $$ A node has 2 properties; the contained elements \u0026amp; the cumulative probability of the elemens in the node. For all elements in the alphabet, create a node: $$ (a,\\ 0.05)\\quad (b,\\ 0.1)\\quad (c,\\ 0.12)\\quad (d,\\ 0.13)\\quad (e,\\ 0.17)\\quad (f,\\ 0.43) $$ Create a single new node, this is the parent of the 2 nodes in your set with the lowest cumulative probabilities: $$ (ab:(a,\\ 0.05)\\ (b,\\ 0.1),0.15)\\quad (c,\\ 0.12)\\quad (d,\\ 0.13)\\quad (e,\\ 0.17)\\quad (f,\\ 0.43) $$ Repeat step 2 over and over until you have a tree: $$ (ab:(a,\\ 0.05)\\ (b,\\ 0.1),0.15)\\quad (cd:(c,\\ 0.12)\\ (d,\\ 0.13),0.25)\\quad (e,\\ 0.17)\\quad (f,\\ 0.43) $$ $$ (abe:(ab:(a,\\ 0.05)\\ (b,\\ 0.1),0.15)\\ (e,\\ 0.17),0.32)\\quad (cd:(c,\\ 0.12)\\ (d,\\ 0.13),0.25)\\quad \\quad (f,\\ 0.43) $$ $$ \\ldots $$ add numbers to the tree branches and append these numbers until reaching a node. these are the codewords.\nIt is easier to see as a gif: once the final tree is given numbers can be applied to the vertacies and the codewords found. Because of the optional 0-1 choice this will not produce unique codes. We can produce the following coded alphabet: $$ a=0101,\\ b=0100,\\ c=001,\\ d=000,\\ e=011,\\ f=1 $$ (it can be seen by replacing 1s with 0s in the above that this is non unique). This wll also produce an optimal code: $$ L_{avg}=\\sum^{|A|}_{i=1} l_i p_i = (4\\cdot 0.05)+(4\\cdot 0.1)+(3\\cdot 0.12)+(3\\cdot 0.13)+(3\\cdot 0.17)+(1\\cdot 0.43) = 2.29 $$","date":"19 December 2017","externalUrl":null,"permalink":"/blog/maths-compression-techniques/","section":"Blog","summary":"","title":"Coding  Theory - Compression Techniques","type":"blog"},{"content":"So you have a ghost blog(or some other amazon web thing), and you\u0026rsquo;re on AWS ubuntu (or another Linux type instance) but you need to back it up. It would seem simple that AWS should offer you a solution, and there is one, just follow these steps:\nPt 1 - Easy Version # Using the AWS command line from the EC2 instance in question we can send files to (and from) a bucket in s3:\n1. # Make sure you have an AWS IAM role, I made an account just for this job and I used that.\n2. # Make sure you get the access keys for that IAM Role (put them somewhere safe) from the IAM console keys section (IAM \u0026gt; Users \u0026gt; Backup Role \u0026gt; Security Creds tab \u0026gt; Create access Key)\n3. # Make your bucket in the s3 console, name it something useful you\u0026rsquo;ll remember.\n4. # Log in to your EC2 instance, using ssh \u0026amp; the keys that were given to you when you launched the instance (I use putty, this is a good tutorial)\n5. # Make sure you have python (it comes on all the EC2 instances anyways):\npython -V 6. # Make sure you have pip:\npip -V \u0026hellip; Don\u0026rsquo;t have it? install it:\nsudo apt-get install python-pip 7. # Make sure you have aws-cli:\naws --version \u0026hellip; don\u0026rsquo;t have it? install it:\npip install awscli 8. # Now we need to set up the aws-cli for the IAM user who will be backing up the stuff\u0026hellip; Hopefully, we haven\u0026rsquo;t lost the keys from step #### 2. Now run the command\naws configure it will give you something like the below\u0026hellip; fill this in with the details you got in step 2.\nAWS Access Key ID \\[None\\]: \u0026lt;YOUR_ID\u0026gt; AWS Secret Access Key \\[None\\]: \u0026lt;YOUR_SECRET\u0026gt; Default region name \\[None\\]: \u0026lt;YOUR_REGION\u0026gt; Default output format \\[None\\]: json 9. # Right, now we are set up we can use the aws s3 cp command to copy things around the place using their API.\n10. # Navigate to the directory you want to back up, something like /var/www/ghost/content then we can run the command:\naws s3 cp ./ s3://bucket_name --recursive --dryrun 11. # This should spit a bunch of stuff onto the screen saying it\u0026rsquo;s copying the files, and you\u0026rsquo;re done! (not really)\u0026hellip; the --dryrun option just shows you what it would be doing; remove that, run it again and **YOU\u0026rsquo;RE DONE!!**­\nMake sure you check your bucket, it should come out like this:\nPt 2 - Use git! # I was thinking, were just moving stuff around to other storage facilities, wouldn\u0026rsquo;t it be lovely to keep it all in place I\u0026rsquo;m super good with restoring and moving files about? What do I use that does this all the time? GIT!\nAfter some thought, it was simple, in fact, if you understand git the idea alone should be enough to understand what you need to go do. If not I have provided a step by step below of the steps I took to reposetrize my ghost blog.\n1. # Go to your origin git location, like, say github, gitlab, or any of the many others (I used GitLab, its private by default \u0026amp; the logs should probably not be public) and create a repo like ghost-backup\n2. # SSH into your computer using PuTTY or some other SSH client, just like when using aws-cli.\n3. # Nav to the content files, usually cd /var/www/ghost/content\n4. # Now, we can run the following commands in turn to create a repo, set its origin (remember to change this bit), add the files to a commit, commit them and push to your origin. shall we begin?\nsudo git init sudo git remote add origin https://#git-origin-domain.com#/#username#/#project-name.git# sudo git add . sudo git commit -m \u0026ldquo;Initial commit\u0026rdquo; sudo git push -u origin master\nHey presto, and you\u0026rsquo;re done! your origin will now have a bunch of easily viewable content, this could potentially be shoved into a cron job with some SSH keys to automate this process of committing and pushing to a remote. Below has some ideas that solve this problem.\nPt 3 - Auto Backups # Crontabs are amazing, they\u0026rsquo;re little bits of code that you can make them run every now and then. we can leverage that to run an s3 command every week!\nFirstly we want to build the shell script which will back the site up; it can live anywhere, just remember where it is. I keep mine in the same folder as the ghost content (for reasons to do with backing up the script also when using git). Here\u0026rsquo;s what were going to do:\n1. # nav to the directory: cd /path/to/dir/\n2. # create file sudo touch ./backup.sh\n3. # Edit the file sudo nano ./backup.sh and add:\ncd /var/www/ghost/content/ aws s3 cp ./images s3://ghost-backup/images --recursive --dryrun aws s3 cp ./data s3://ghost-backup/data --recursive --dryrun This will nav to the content folder of ghost, then send the data \u0026amp; images to the s3 bucket (which has to be pre-built, see the section above on how to do that). If you remove the --dryrun command from the lines this will go live and actually push the files, for testing so it\u0026rsquo;s fine to leave it in but it won\u0026rsquo;t actually do any copying when run.\nRunning sudo sh /path/to/dir/file.sh should now dry-run your backup script \u0026amp; have a bunch of output. It will be telling you how much stuff is moving about. This should be just your images \u0026amp; data; this will cost a little bit on AWS so make sure you\u0026rsquo;re not coping with millions of items! If it\u0026rsquo;s working then great we can move on to setting up a crontab for the shell script.\nThe command sudo crontab -e will let you edit the root users crontab, so everything that runs will have the highest permissions. Here we can add the path to our shell script (or whatever other script you want) that will back up the site. By adding 0 0 * * 0 sudo sh /path/to/dir/file.sh to the crontab. We will be running the script every week on a Sunday. This question has lots of answers to how to schedule the scripts.\nPersonally, I tested the system with the --dryrun commands in place and made the crontab run my script every minute. Then I could check it running using grep CRON /var/log/syslogs to see if my script ran. Once it was running how I wanted I removed the --dryrun and changed the frequency. Googling how CRON works is a good idea to make sure your scripts are running as needed.\nNote: s3 has a limit on the number of free pushes you can have, CHECK THIS BEFORE JUST LEAVING IT RUNNING\u0026hellip; BILLS CAN RUN AWAY FROM YOU IF YOU\u0026rsquo;RE SENDING LOTS OF DATA\nI currently spend about $1 a year on my s3 pushes; it\u0026rsquo;s not much but there\u0026rsquo;s not a built-in thing for it yet so its worth it.\n","date":"14 December 2017","externalUrl":null,"permalink":"/blog/backing-up-a-ghost-blog-on-aws-ec2-to-s3/","section":"Blog","summary":"","title":"Backing up a Ghost blog (or anything) on AWS EC2 to S3","type":"blog"},{"content":"","date":"14 December 2017","externalUrl":null,"permalink":"/tags/pdes/","section":"Tags","summary":"","title":"Pdes","type":"tags"},{"content":" Ever wondered what the uses for taylor expansions are in the field of differential equations? no? well you should, its rather facinating\u0026hellip;\nFirst, what is a taylor expansion? well, basically, it says if youre trying to evaluate a function at a point that\u0026rsquo;s \u0026ldquo;close enough\u0026rdquo; to a point you already know you\u0026rsquo;ll be able to represent this slight difference as an infinate series:\n$$ (1)\\quad f(x+h)=f(x)+hf'(x)+\\frac{h^2}{2!}f''(x)+O(h^3) $$ Or if you\u0026rsquo;re more comfortable with summation notation: $$ (2)\\quad f(x+h)=\\sum^{\\infty}\\_{n=0}\\frac{(h)^n}{n!}f^{(n)}(x) $$Here \\(O(h^3)\\) represents an arbitrary function of order \\(h^3\\) (basically just a function in \\(h\\) with the smallest \\(h\\) term being \\(h^3\\), this is important and we will use this in a bit\u0026hellip;\nSo how does this relate to a PDE? Well, we can estemate derivatives of a functon using this method, right? (as long as everything is nicly behaved) take eqn (1) above, and rearange it,get fid of a few terms \u0026amp; this will give you:\n$$ (3)\\quad f'(x)=\\frac{f(x+h)-f(x)}{h}-\\frac{h}{2!}f''(x)+O(h^2) $$ or setting \\(F'(x)\\approx f'(x)\\) $$ (4)\\quad F'(x)=\\frac{f(x+h)-f(x)}{h} $$From here we can see that we can somewhat accuratly calculate a derivative of a function using the evaluation at 2 points that are \u0026ldquo;close enough\u0026rdquo; together (this can be done in both directions by using \\(-h\\) insted). That\u0026rsquo;s cool, but now what?\nSee this grid? This is a plane where a PDE lives. We can solve the PDE at a point on this plane if we use the above fomulas in a clever way. Imagine were ust working on the line \\(y=0\\) and we want to work out \\(f'(0.3)\\) where \\(f(x)=x^2\\), how do we do this?\nRemember that we can aprox \\(f'(x)\\) using (4). Set \\(h\\) to be 0.1, or the spacing between the grid points. Then we get $$ F'(0.3)=\\frac{f(0.3+0.1)-f(0.3)}{0.1} $$ Which we get \\(f'(0.3)\\approx F'(0.3)=\\frac{(0.4)^2-(0.3)^2}{0.1}= 0.7 \\approx 2(0.3)\\) And we have our approxamate answer at \\(x=0.3\\) We can even make this more accurate by taking our expansion in \\(+h\\) and finding the difference from our expansion in \\(-h\\). essentially taking the difference between the 2 points either side of the position, \\(x\\), we\u0026rsquo;re looking at. we get:\nForward: \\(f(x+h)=f(x)+hf'(x)+\\frac{h^2}{2!}f''(x)+O(h^3)\\)\nBackward: \\(f(x-h)=f(x)-hf'(x)+\\frac{(-h)^2}{2!}f''(x)+O(h^3)\\)\nso we can take the difference:\n$$ f(x+h)-f(x-h)=+2hf'(x)+O(h^3) $$$$ \\Rightarrow f'(x)=\\frac{f(x+h)-f(x-h)}{2h}+O(h^2) $$or setting \\(F'(x)\\approx f'(x)\\)\n$$ (5)\\quad F'(x)=\\frac{f(x+h)-f(x-h)}{2h} $$Its a much nicer result beacuse our \\(O(h^2)\\) term well decay faster if we make the difference, \\(h \\rightarrow 0\\). We can show that the error we get is much smaller if we try again, so lets do the exaple of \\(f'(0.3)\\) where \\(f(x)=x^2\\) again:\nWith \\(F'(x)\\approx f'(x)\\) we can use eqn (5). Subbing everything in we get \\(F'(0.3)=\\frac{f(0.3+0.1)-f(0.3-0.1)}{2(0.1)}=\\frac{(0.4)^2-(0.2)^2}{0.2}=0.6\\) TADAAA Using this 2 sided approach is called the centered difference approximation and is much more accurate, it should be used if possible. Hint: if youre trying to calculate approx for \\(f'(x)\\) and \\(f(x\\pm h)\\) doesn\u0026rsquo;t exist, then a problem has been encoutered and must be solved. There are ways of solving these we will come on to.\nBut first I hear you ask \u0026ldquo;Toby, is there a way of doing this for approximating a second derivative?\u0026rdquo; to which I reply: \u0026ldquo;why yes, its easy enough, just rearange the taylor expansion for the particular derivative you need!\u0026rdquo;. I have lied to you though, it is not \u0026ldquo;easy enough\u0026rdquo;; with a simple push however it can be.So what are we really looking for? We want to find this approximation to the second derivative, or \\(F''(x)\\approx f''(x)\\). And we want this \\(F(x)\\) as a linear combination of the points \\(f(x-h),\\ f(x),\\ f(x+h)\\) or, in a more verbose defintion:\n$$ F''(x)=Af(x-h) + Bf(x) + Cf(x+h) $$ which then gives, after expanding these points out, or using both the forward and backwards expansions given above: \\(F''(x)=A\\large{[} f(x) + hf'(x) + \\frac{h^2}{2!} f''(x) + O(h^3) \\large{]}\\) \\(\\qquad \\quad +B\\,f(x)\\) \\(\\qquad \\quad +C\\large{[} f(x) - hf'(x) + \\frac{(-h)^2}{2!} f''(x) + O(h^3) \\large{]}\\)\nWhere collecting terms in this expansion provides a nice set of coefficents to solve for: $$ F''(x)=[A+B+C]\\, f(x)+h[A-C]\\, f'(x)+\\frac{h^2}{2!}[A+C]\\, f''(x)+[A+C]\\, O(h^3) $$We want to set \\((A+B+C)=0\\) , \\((A-C)=0\\) and \\(\\frac{h^2}{2!}[A+C]=1\\) then the rest of the combos make up the error, so they can be ignored. Solving all of these we can just set:\n\\(A = C = \\frac{1}{h^2}\\) by the second two eqns. \\(B = -(A+B) = -\\frac{2}{h^2}\\) by the first. These coeficients allow the approximation for \\(f''(x) \\approx F''(x)\\) to be: $$ F''(x)=Af(x-h) + Bf(x) +Cf(x+h)=\\frac{f(x-h) - 2f(x) +f(x+h)}{h^2} $$The motive for taking centered approximations can be seen below. Its obvious the red line has a gradient closer to \\(f(4)\\) than the other two lines.\nFrom this point on we can increase the dimension of \\(f(x)\\) to be a function of 2 variables: \\(f(x,y)\\), and we can expand \\(f\\) in both \\(x\\) (as we did before), and \\(y\\) (by using the spacing of \\(\\pm k\\) insted of \\(\\pm h\\)). Then we can calculate partial differentials in the same manner; expand the taylor series and rearange for an approximation. Only this time insted of arbitary spacing we use the concept of creating a mesh with points spaced out over our domain this is known asdiscritizing the domain(the grid above is a prime example). This then gives:\n$$ f(x\\pm h,y) = f(x,y) \\pm h \\frac{\\partial f(x,y)}{\\partial x} + \\frac{h^2}{2!} \\frac{\\partial^2 f(x,y)}{\\partial x^2} + O(h^3) $$ and $$ f(x,y \\pm k) = f(x,y) \\pm k \\frac{\\partial f(x,y)}{\\partial y} + \\frac{k^2}{2!} \\frac{\\partial^2 f(x,y)}{\\partial k^2} + O(k^3) $$Note that the partial differentials have are actually evluated at a given \\(x\\) and \\(y\\) so it may be more obvious whats going on using: \\(\\frac{\\partial f(x,y)}{\\partial x} = \\frac{\\partial f}{\\partial y}\\mid\\_{x,y}\\)\n","date":"14 December 2017","externalUrl":null,"permalink":"/blog/taylor-expansions-in-pdes/","section":"Blog","summary":"","title":"Taylor Expansions in PDEs","type":"blog"},{"content":" This Post is a general discussion on how genetic algorithms work and how to model them. Typically GAs are built to solve a single problem, however the concept of genetic improvement can be extended into building functionality too.. (this isn\u0026rsquo;t a topic of study so check out this site for context for that.)\nDuring my studies I eve build a new python module to construct a nice framework for GAs.\nGenetic Algorithms fall under a branch of machine learning called feature selection. Techniques of using genetic algorithms for generating solutions to problems typically revolve around heuristically improving members of a population who represent these solutions. The concepts of a genetic algorithm come from nature; like nature we create a survival of the fittest competition to evaluate a population then kill off the weakest members. After this cull we create offspring from the most successful population or introduce new members from a predefined source. This process is then repeated until we stop, or forever in the case of nature.\nWe say a genetic algorithm is structured in the following way. Given a population, \\(P\\), each with unique genes, and a number of generations, \\(G\\in \\mathbb{N}\\), the algorithm will create \\(G\\) loops of scoring and potentially removing each of the members of the population. It does this by using a heuristic function \\(f(p\\_i)\\mapsto \\mathbb{R}, \\ p\\_i \\in P\\). This function, \\(f\\), is defined beforehand in a way which describes the goal of our investigation. Defining a cutoff or bottleneck \\(b\u003c|P|\\), such that on conclusion of scoring the population, the top \\(b\\) ranking members by score can be kept and the rest discarded. Once we have removed a certain percentage of our population we can rebuild it using a series of crossovers and mutations (and possibly introducing new members into the population).\nCrossovers take in 2 members of the population and return a new member based on some parameters of the 2 `parents\u0026rsquo;. For example, our crossover takes the first half of a sequence from one and the second half from the other, merging them to form the third. Mutations allow a (possibly targeted) change in a single member of the population. A mutation has 2 parameters, a potency \\(M_p \\in \\mathbb{R}\u003e0\\) and a frequency \\(M_f \\in [0,1]\\). \\(M_p\\) describes how strong the mutation is, the higher it is the larger change to the member occurs. \\(M_f\\) explains the percentage of how many members of the population are mutated. $$ x=1000 $$","date":"13 December 2017","externalUrl":null,"permalink":"/blog/genetic-algorithms/","section":"Blog","summary":"","title":"Genetic Algorithms","type":"blog"},{"content":"The tobydevlin.com website is the main product of this large experiment of web design, service building and self tutoring. Understanding web development is pretty crucial to getting a cushy dev job once you graduate so I\u0026rsquo;m teaching myself ­ƒôê­ƒôê. Hopefully, if I\u0026rsquo;m good enough, not only will the main page be up, but this blog will also be around too ­ƒÆ¬.\nPlease note: Products mentioned here are because I like using them \u0026amp; they work for me, not because anyone paid me to tell you they\u0026rsquo;re good! ­ƒÖä\nWhat Am I Making? # The website will be built using the modern SPA (single page application) design, so its really more of a webpage than a whole site. This page then needs some web services sitting on a server somewhere that can provide it with data and security services. By creating my site like this **I\u0026rsquo;ve totally invented a microservice structure! ** Making things into a network and distributing the processes around the place means things are flexible and easily maintainable.\nServices can be built with anything, there are frameworks in lots of languages:\nNode.js Java Python Go Ruby PhP Rust The list goes on\u0026hellip; They are usually bits of code running on a server somewhere (like in the cloud) and just respond to anything that gets asked to them. Its just a name though as these **microservices could be pretty huge ** codebases if the service is compex. If you built it wrong it will mostly consist of 500 errors and lots of unhappy customers, so get these bits right in the design before starting your build. Personally I like Python and Node for services because of their flexibility, though Java can be powerful when it needs to be \u0026amp; I don\u0026rsquo;t know the other languages on the list.\nThese websites (or in my case SPA sites), on the left, could be anything that needs the services on the right (tho not this blog, its a little different and is more like a service for access flexibility ­ƒô▓). They will all be static pages \u0026amp; are a conceptually different to services; theres no server side processing built inside them and its easy enough to **just shove content in an aws S3 bucket or on a github page ** and point your DNS to them. (Jakyl Ill be writing a post on Googles new way of doing things soon)\n**For a blog site its even easier! ** All you do is look for a framework that suits your needs, follow the steps they detail to make your content, go to your hosting provider and ask them how to point your domain at the hosted content, do that then you\u0026rsquo;re done! ­ƒÅü­ƒÅü­ƒÅü You will have a page on your domain, hosted by someone(maybe you), and then you can just focus on content. **How techie you want to be is up to you **, ghost is what this blog sits on and is perfect for what I want to do but has its limits.\nWhat About Making Custom Pages? # Like I mentioned previously, theres lots of parts to a website, the front end however wants to be personal and beautiful. **tobydevlin.com wants to be something infinitely customizable **. The hardest part about writing a website is actually writing it. For tobydevlin.com and some of its static subdomains I\u0026rsquo;ll be using a few major frameworks surrounding a javascript, HTML \u0026amp; CSS core:\nFramework Reason Yeoman Easily allows me to scaffold out a template pages with a few lines in the terminal. React.js fully functional component based UI designer, not just limited to the web for easily porting ideas to native apps. Webpack The requirement to be static is huge, and this is a super easy way to compress and minify the dependency codebase PostCSS A powerful way to write css (we also sass) **Yeoman is a super cool idea **, have a tech stack you think someone has probably thought of before and probably is common because it works? Check here and if theres a generator for the stack you\u0026rsquo;re in luck! follow the tutorials and plop down a bunch of code in a file (I\u0026rsquo;m using this generator). If you follow the steps listed in the Yeoman tutorial you\u0026rsquo;ll get something like this:\nIf you did, nice!Ô£ï This is just a boilerplate, you can **mess about with the code ** by write your data structures, adingd npm modules for interesting features like async data requests to your services and more! The only limitation is your imagination!! (and computing power, budget, client requests, etc.. but ignore them for now, this is for fun!). During your build there are some tools that come along with this generator that make your life super easy:\nnpm start | Run a dev server with live reload on changes 2. npm run dist then **npm run serve:dist ** | Build the dist version and copy static files then start the dev-server with the dist version 3. npm test or npm run test:watch | Run all unit tests or Auto-run unit tests on file changes This is kinda where I\u0026rsquo;m at overall, now I\u0026rsquo;m just playing around. React has a steep learning curve so putting together the basics is getting the best of me, but it will probably be worth it the end.. Check back in a few years and I might have a second page! ­ƒÜÆ­ƒÜÆ­ƒÜÆ\n","date":"25 November 2017","externalUrl":null,"permalink":"/blog/building-the-website/","section":"Blog","summary":"","title":"Building tobydevlin.com","type":"blog"},{"content":" This will concern mostly the section of linear codes in the course of Coding Theory \u0026amp; Data Compression at Cardiff University. It is expected the reader knows about some sections of coding theory, there isn\u0026rsquo;t background reading on this blog\u0026hellip; yet­!.\nThings to know to start:\nThe alphabet we will be using is the set \\(F\\_q\\) where \\(q\\) is prime. We will regard the vector space \\(V(n,q)\\) as the set of words \\((F\\_q)^n\\), a vector in \\(V(n,q)\\) denoted \\((x\\_1,x\\_2,...,x\\_n) \\) will be written as \\(x\\_1 x\\_2 ... x\\_n\\). A linear code \\(C\\) is just a subset of the space \\(V(n,q)\\) This code will a linear code in itself if and only if it is a vector space under the same operations as \\(V(n,q)\\). A binary code will be linear if and only if the sum of any two words is itself a word in the original set. i.e. \\( \\forall x,y \\in C \\quad (x + y) \\in C \\) Note:\nA \\(q\\)-ary code \\([n,k,d]\\) is also a \\(q-\\)ary code \\((n,q^k,d)\\) code (by a theorem on basis of subspaces). This is not two way, \\([n,k,d] \\Rightarrow (n,q^k,d)\\) but \\( (n,q^k,d) \\not\\Rightarrow [n,k,d]\\). The \\(\\bf{0}\\) vector is automatically in any linear code. Linear Codes may be referred to as Group Codes in some texts. The terms \u0026ldquo;word\u0026rdquo; and \u0026ldquo;vector\u0026rdquo; are synonyms in the context of a linear code. The weight, \\(w(\\bf{x})\\), of a word in \\(V(n,q)\\) is defined to be the number of non-zero entries in the word: \\(\\bf{x}=111010\\) has \\(w(\\bf{x})=2)\\).\nOne of the most useful properties of a linear code is that \\(d(C)\\), its minimum distance, is equal to the smallest of the wights of its codewords: \\(d(C)= min(w(\\bf{x})) \\quad \\forall \\bf{x} \\in C\\).\n","date":"24 November 2017","externalUrl":null,"permalink":"/blog/maths-coding-theory-linear-codes/","section":"Blog","summary":"","title":"Coding  Theory - Linear Codes","type":"blog"},{"content":" So, after (many) hours of not paying attention to my lecturer, I\u0026rsquo;ve finally managed to get this ghost thing working. Maybe I\u0026rsquo;ll write a nice little piece on it in the future. For now, I will just be using this place as a way of keeping track of things that happen during my time in uni.\nRoadmap For This Post: # Get cool maths stuff working Put some sweet images on this thing to make it fun Get some meta stuff and learn SEO (make a post about that too!) Put in some interesting code that works So there it is, everything I\u0026rsquo;ll want to get out the way to start with. Shall we begin?\nCool Maths Stuff # As far as \u0026ldquo;cool\u0026rdquo; goes I don\u0026rsquo;t know, but here\u0026rsquo;s a nice formula I learned for a perfect number, which when I get around to making LaTeX working, will have a delightfully nice style to it:\n$$ x = \\sum{a_i} \\quad \\forall a \\textrm{ where } \\frac{x}{a} \\in \\mathbb{Z}\\backslash \\{ x\\} $$for example, the first perfect number, 6:\n$$ 6 = 1+2+3 \\textrm{ where } \\{ \\tfrac{6}{1} = 6,\\tfrac{6}{2} = 3, \\tfrac{6}{3} = 2 \\in \\mathbb{Z} \\} $$and voila, it just kinda works!\nSome Sweet Images # Here is a sweet image that I can add in. This even comes from my own camera!\nBefore one uploads images to their blog, one must fix all the problems related to uploading images to their blog. - Toby Devlin 2017\nFun fact: the Hungarian parliament building is ranked as the worlds most impressive parliament building by Wow Travel. And yes its as nice on the inside as it looks on the outside.\nSo You Want To Know Code # Have you or somebody you know been suffering from Python multiprocessing issues in windows?\nHave you or somebody you know woken up to a broken build when it wasn\u0026rsquo;t your fault?\nIf so you may be entitled to financial compensation, call us now on\u0026hellip;. WAIT NO! Just put this line on the module/script:\nif __name__ = \u0026#34;main\u0026#34;: freese_support() This will fix all your woes and everything should be fine and dandy. The docs, have some pretty in depth stuff on multipocessing and why it breaks on windows but theres too much to explain in this section. Happy reading!\nWant to make a certificate? try this:\nopenssl req -newkey rsa:2048 -nodes -keyout private_key.pem -x509 -days 365 -out public_cert.pem And there we have it, the first post on the site that looks a little nerdy and should probably have been shorter based on how much revision I\u0026rsquo;m not doing. Maybe I\u0026rsquo;ll combine procrastinating and doing revision into one\u0026hellip; procrastavison! That sounds like a brilliant idea!\n","date":"22 November 2017","externalUrl":null,"permalink":"/blog/hi-there/","section":"Blog","summary":"","title":"Hello World!","type":"blog"},{"content":"Hi there! I\u0026rsquo;m an AI, data and platforms consultant based in London - Currently I\u0026rsquo;m the AI, data and platforms guy at Hippo Lab; building secure proactive healthcare systems for millions of people using billions of data points.\nThis site is a collection of notes, ideas, and deep dives into engineering, systems design, and platform architecture.\n","externalUrl":null,"permalink":"/","section":"","summary":"","title":"","type":"page"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/blog/","section":"Blog","summary":"","title":"Blog","type":"blog"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"}]