<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>airflow Archives - Turbolab Technologies</title>
	<atom:link href="https://turbolab.in/tag/airflow/feed/" rel="self" type="application/rss+xml" />
	<link>https://turbolab.in/tag/airflow/</link>
	<description>Big Data and News Analysis Startup in Kochi</description>
	<lastBuildDate>Fri, 26 Jul 2024 10:00:38 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://i0.wp.com/turbolab.in/wp-content/uploads/2018/03/turbo_black_trans-space.png?fit=32%2C32&#038;ssl=1</url>
	<title>airflow Archives - Turbolab Technologies</title>
	<link>https://turbolab.in/tag/airflow/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">98237731</site>	<item>
		<title>How to monitor work-flow of scraping project with Apache-Airflow</title>
		<link>https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/</link>
					<comments>https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/#respond</comments>
		
		<dc:creator><![CDATA[Vasista Reddy]]></dc:creator>
		<pubDate>Wed, 22 Dec 2021 08:16:05 +0000</pubDate>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[airflow]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[monitor]]></category>
		<category><![CDATA[scraping]]></category>
		<guid isPermaLink="false">https://turbolab.in/?p=823</guid>

					<description><![CDATA[<p>Apache Airflow is a platform to programmatically monitor workflows, schedule, and authorize projects. In this blog, we will discuss handling the workflow of scraping yelp.com with Apache Airflow. Quick setup of Airflow on ubuntu 20.04 LTS # make sure your system is up-to-date sudo apt update sudo apt upgrade # install airflow dependencies  sudo apt-get install libmysqlclient-dev [&#8230;]</p>
<p>The post <a href="https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/">How to monitor work-flow of scraping project with Apache-Airflow</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p class="graf graf--p">Apache Airflow is a platform to programmatically monitor workflows, schedule, and authorize projects.</p>
<p class="graf graf--p">In this blog, we will discuss handling the workflow of scraping <strong><a class="markup--anchor markup--p-anchor" href="https://www.yelp.com/" target="_blank" rel="noopener">yelp.com</a></strong> with Apache Airflow.</p>
<h2 class="graf graf--h3">Quick setup of Airflow on ubuntu 20.04 LTS</h2>
<p># make sure your system is up-to-date</p>
<blockquote>
<pre class=" prettyprinted"><span class="pln">sudo apt update
sudo apt upgrade</span></pre>
</blockquote>
<p><em># install airflow dependencies </em></p>
<blockquote>
<pre class="je jf jg jh ji ki kj kk"><span id="e9fa" class="he kl km gh kn b dt ko kp s kq">sudo apt-get install libmysqlclient-dev</span></pre>
<pre class="je jf jg jh ji ki kj kk"><span id="efb9" class="he kl km gh kn b dt kr ks kt ku kv kp s kq">sudo apt-get install libssl-dev</span></pre>
<pre class="je jf jg jh ji ki kj kk"><span id="01af" class="he kl km gh kn b dt kr ks kt ku kv kp s kq">sudo apt-get install libkrb5-dev</span></pre>
</blockquote>
<p class="graf graf--h3"><em># create the virtual env and install the airflow using pip</em></p>
<blockquote>
<pre class=" prettyprinted"><span class="pln">sudo apt install python3</span><span class="pun">-</span><span class="pln">virtualenv
virtualenv airflow_test
cd airflow_test</span><span class="pun">/</span><span class="pln">
source bin/activate
</span><span class="kwd">export</span><span class="pln"> AIRFLOW_HOME</span><span class="pun">=~/</span><span class="pln">airflow # set Airflow home
pip3 install apache</span><span class="pun">-</span><span class="pln">airflow
pip3 install typing_extensions
airflow db init # initialize the db</span></pre>
</blockquote>
<p class="graf graf--p"><strong class="markup--strong markup--p-strong">db, unittests, logs, configuration(cfg)</strong> files will be generated inside <strong class="markup--strong markup--p-strong">Airflow_Home</strong></p>
<p class="graf graf--h4"># <em>Start a WebServer &amp; Scheduler</em></p>
<blockquote>
<pre class="graf graf--pre"><em>airflow webserver -p 8080 # start the webserver</em></pre>
<pre class="graf graf--pre"><em>airflow scheduler # start the scheduler
</em></pre>
</blockquote>
<p class="graf graf--p">By default it is localhost. If you wish to change, you can give the command like this</p>
<blockquote>
<p>airflow webserver -H xxx.xxx.xxx.xxx -p 9005</p>
</blockquote>
<p class="graf graf--p">Check the quick installation guide <a href="https://airflow.apache.org/docs/apache-airflow/1.10.12/start.html"><strong>here</strong></a>.</p>
<p class="graf graf--p">If everything goes well, we can see the apache airflow web interface</p>
<pre class="graf graf--pre"><em><strong><a class="markup--anchor markup--pre-anchor" href="http://localhost:8080/admin/" target="_blank" rel="nofollow noopener">http://localhost:8080/admin/</a> # web-server</strong></em></pre>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figure style="width: 1500px" class="wp-caption aligncenter"><img data-recalc-dims="1" fetchpriority="high" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AOgwTYIIY1G4QghIE9TPl_Q.png?resize=800%2C411&#038;ssl=1" alt="" width="800" height="411" /><figcaption class="wp-caption-text">Airflow WebServer</figcaption></figure>
<figcaption class="imageCaption"></figcaption>
</figure>
<p class="graf graf--p">Everything in Airflow works as <strong class="markup--strong markup--p-strong">DAGs</strong>(Directed acyclic Graphs). We need to create a DAG with a unique dag_id and nest the tasks to that dag_id created. Simply put, DAG is the collection of tasks we want to run. Parameters like <strong>schedule_time</strong>, <strong>start_time</strong>, <strong>author</strong>, and other parameters can also be passed to the DAG object.</p>
<p class="graf graf--p">Create a folder named <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">dags</strong></code> inside the <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">Airflow_Home,</strong></code>the Scheduler will be checking for new DAGs for every 300&#8217;s, if any new dags are found — you can see them at web-server.</p>
<figure id="attachment_832" aria-describedby="caption-attachment-832" style="width: 1077px" class="wp-caption alignnone"><img data-recalc-dims="1" decoding="async" data-attachment-id="832" data-permalink="https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/screenshot-from-2021-12-22-14-12-05/" data-orig-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?fit=1077%2C194&amp;ssl=1" data-orig-size="1077,194" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Screenshot from 2021-12-22 14-12-05" data-image-description="" data-image-caption="&lt;p&gt;Airflow Scheduler&lt;/p&gt;
" data-large-file="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?fit=800%2C144&amp;ssl=1" class="size-full wp-image-832" src="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=800%2C144&#038;ssl=1" alt="" width="800" height="144" srcset="https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?w=1077&amp;ssl=1 1077w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=300%2C54&amp;ssl=1 300w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=768%2C138&amp;ssl=1 768w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=1024%2C184&amp;ssl=1 1024w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=980%2C177&amp;ssl=1 980w, https://i0.wp.com/turbolab.in/wp-content/uploads/2021/12/Screenshot-from-2021-12-22-14-12-05.png?resize=480%2C86&amp;ssl=1 480w" sizes="(max-width: 800px) 100vw, 800px" /><figcaption id="caption-attachment-832" class="wp-caption-text">Airflow Scheduler</figcaption></figure>
<figure class="graf graf--figure">
<figcaption class="imageCaption"></figcaption>
</figure>
<p class="graf graf--p">We are going to create a workflow to scrape <strong class="markup--strong markup--p-strong">yelp.com</strong> for business listings &amp; save the data to MongoDB.</p>
<p class="graf graf--p">The code to be used in this tutorial to scrape the <strong class="markup--strong markup--p-strong">yelp.com</strong> can check <a href="https://gist.githubusercontent.com/Vasistareddy/26a37b841e93756ab3256022e6daa09d/raw/a75b6b277ed64c953e09094e60e5f18d1789573a/yelp_search.py"><em><strong>here</strong></em></a>.</p>
<h2 class="graf graf--h3">Creation of DAG</h2>
<blockquote>
<pre class="graf graf--pre"><em>from airflow import DAG
from datetime import datetime</em></pre>
<pre class="graf graf--pre"><em># dag creation
default_args = {'owner': 'turbolab', 'start_date': datetime(2019, 1, 1), 'depends_on_past': False}
_yelp_workflow = DAG('_yelp_workflow', catchup=False, schedule_interval=None, default_args=default_args) # creating a DAG</em></pre>
</blockquote>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figure style="width: 1500px" class="wp-caption aligncenter"><img data-recalc-dims="1" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AoS8FS0p8EM1o3XYfnAxrNA.png?resize=800%2C218&#038;ssl=1" alt="" width="800" height="218" /><figcaption class="wp-caption-text">DAG Created</figcaption></figure>
<figcaption class="imageCaption"></figcaption>
</figure>
<p class="graf graf--p"><code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">_yelp_workflow</strong></code> DAG is created. <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">schedule_interval=None</strong></code> is for manual triggering the DAG. Other options are <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">@daily, @weekly,</strong></code> <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">“* * * */2 1”</strong></code>(cron schedule). Know about <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">catchup, depends_on_past</strong></code> the airflow documentation <a class="markup--anchor markup--p-anchor" href="https://airflow.apache.org/scheduler.html" target="_blank" rel="noopener"><strong class="markup--strong markup--p-strong">here</strong></a><strong class="markup--strong markup--p-strong">.</strong></p>
<h2 class="graf graf--h3">Task Creation</h2>
<p class="graf graf--p">With the airflow set of <a class="markup--anchor markup--p-anchor" href="https://airflow.apache.org/concepts.html#operators" target="_blank" rel="noopener"><strong class="markup--strong markup--p-strong">operators</strong></a><strong class="markup--strong markup--p-strong">, </strong>we can define tasks of the DAG workflow. An operator describes a single task in a workflow. While DAGs describes how to run a workflow, <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">Operators</strong></code> determine what actually gets done. To call a python function — <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">PythonOperator</strong></code>, for an Email — <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">EmailOperator</strong></code>, for a Bash command —<code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong"> BashOperator</strong></code>, for a SQL instruction — <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">MySqlOperator</strong></code> etc.,</p>
<p class="graf graf--p">Generally, operators run independently with no sharing of information in the order specified. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called <strong class="markup--strong markup--p-strong"><em class="markup--em markup--p-em">XCom.</em></strong></p>
<blockquote>
<pre class="graf graf--pre"><em>def url_generator(**kwargs):
    <strong class="markup--strong markup--pre-strong">""" 
    generating the yelp url to find the business listings with place and search_query 
    {'place': 'Location | Address | zip code'}
    {'search_query': "Restaurants | Breakfast &amp; Brunch | Coffee &amp; Tea | Delivery | Reservations"}
    """</strong>
    place = Variable.get("place")
    search_query = Variable.get("search_query")
    yelp_url = "<a class="markup--anchor markup--pre-anchor" href="https://www.yelp.com/search?find_desc=%7B0%7D&amp;find_loc=%7B1" target="_blank" rel="nofollow noopener">https://www.yelp.com/search?find_desc={0}&amp;find_loc={1</a>}".format(search_query,place)
    return yelp_url</em></pre>
<pre class="graf graf--pre"><em><strong class="markup--strong markup--pre-strong">"""defining a task"""
yelp_url_generator = PythonOperator(
    task_id='url_generator',
    python_callable=url_generator,
    provide_context=True,
    dag=_yelp_workflow)</strong></em></pre>
</blockquote>
<p class="graf graf--p">Likewise, 6 tasks were created and the concepts like <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong"><em class="markup--em markup--p-em">variables</em></strong></code> and <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong"><em class="markup--em markup--p-em">xcom</em></strong></code> are used among the tasks.</p>
<h2 class="graf graf--h3">Concept of xcom</h2>
<blockquote>
<pre class="graf graf--pre"><em>def get_response(**kwargs):
    """
    validating the url and forwarding the response
    """
    <strong class="markup--strong markup--pre-strong">ti = kwargs['ti']
    url = ti.xcom_pull(task_ids='url_generator')
    print('url generated: ', url)</strong>
    headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chrome/70.0.3538.77 Safari/537.36'}
    success = False
    
    for retry in range(10):
        response = requests.get(url, verify=False, headers=headers)
        if response.status_code == 200:
            success = True
            break
        else:
            print("Response received: %s. Retrying : %s"%(response.status_code, url))
            success = False
    
    if success == False:
        print("Failed to process the URL: ", url)
        raise ValueError("Failed to process the URL: ", url)
    return response</em></pre>
<pre class="graf graf--pre"><em><strong class="markup--strong markup--pre-strong">response_generator = PythonOperator(
    task_id='response_generator',
    python_callable=get_response,
    provide_context=True,
    dag=_yelp_workflow)</strong></em></pre>
</blockquote>
<p class="graf graf--p"><code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">url_generator</strong></code> task returning the <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">yelp_url</strong></code> has to pass to <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">response_generator</strong></code> task, where we will be checking the response of the URL. If the status_code of the response is <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">200</strong></code>, we are returning — otherwise raising a <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">ValueError</strong></code> to stop the pipeline.</p>
<p class="graf graf--p"><strong class="markup--strong markup--p-strong">xcom</strong>’s can be viewed at the admin page after the successful task runs.</p>
<figure class="graf graf--figure"><img data-recalc-dims="1" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1000/1%2ADFuCAJ27zH6GGOE0_vrlcg.gif?w=800&#038;ssl=1" /></figure>
<h2 class="graf graf--h3">Concept of variable</h2>
<p class="graf graf--p">This concept is used when the user has to input the values(like command-line arguments in python) to the tasks created.</p>
<blockquote>
<pre class="graf graf--pre"><em><strong class="markup--strong markup--pre-strong">place = Variable.get("place")
search_query = Variable.get("search_query")</strong></em></pre>
</blockquote>
<p class="graf graf--p">These variables <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">place</strong></code> and <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">search_query</strong></code> are used in the <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">url_generator</strong></code> python function of <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">yelp_url_generator</strong></code> task.</p>
<figure class="graf graf--figure">
<figure style="width: 1178px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1000/1%2ARTGpoXRfHbC1VD5kvUsH2A.gif?resize=800%2C535&#038;ssl=1" alt="" width="800" height="535" /><figcaption class="wp-caption-text">Variables Creation</figcaption></figure>
<figcaption class="imageCaption"></figcaption>
</figure>
<h2 class="graf graf--h3">Tasks Relationship/Arrangement</h2>
<p class="graf graf--p">The DAG will make sure that operators run in the correct certain order. Check <a class="markup--anchor markup--p-anchor" href="https://airflow.apache.org/concepts.html#dag-assignment" target="_blank" rel="noopener"><strong class="markup--strong markup--p-strong">here</strong></a><strong class="markup--strong markup--p-strong">.</strong></p>
<figure class="graf graf--figure graf--layoutOutsetCenter"><img data-recalc-dims="1" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2A8DGPBXdRHVFeiXhk0LbG1Q.gif?w=800&#038;ssl=1" /></figure>
<blockquote>
<pre class="graf graf--pre"><strong class="markup--strong markup--pre-strong">end_task &lt;&lt; validate_db &lt;&lt; writing_to_db &lt;&lt; validate_data &lt;&lt; get_data &lt;&lt; response_generator &lt;&lt; yelp_url_generator &lt;&lt; start_task</strong></pre>
</blockquote>
<p class="graf graf--p">airflow upstream arrangement of tasks with <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">start_task and end_task</strong></code> is dummy tasks(<strong class="markup--strong markup--p-strong"><em class="markup--em markup--p-em">optional</em></strong>). Others <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">yelp_url_generator →response_generator →get_data →validate_data →writing_to_db →validate_db</strong></code> are python tasks.</p>
<p class="graf graf--p"><a class="markup--anchor markup--p-anchor" href="https://gist.githubusercontent.com/Vasistareddy/f0b5f7d73efc900f269e0aa81d04e81b/raw/8cc88cd5bd31fe219368b522b3dea3945e21caf4/yelp_business_listings.py" target="_blank" rel="noopener"><strong class="markup--strong markup--p-strong">Check the complete code here</strong></a><strong class="markup--strong markup--p-strong"> </strong></p>
<h2 class="graf graf--h3">Triggering the DAG</h2>
<p class="graf graf--p">Since we kept <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">schedule_interval=None,</strong></code> we have to manually trigger the DAG. Let’s see how to do that →</p>
<figure class="graf graf--figure graf--layoutOutsetCenter"><img data-recalc-dims="1" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AsUBKsJsFllw4_xwpfA2EMg.gif?w=800&#038;ssl=1" /></figure>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figure style="width: 1500px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2ADsa6AxsvP_YrNdarurz1rA.png?resize=800%2C219&#038;ssl=1" alt="" width="800" height="219" /><figcaption class="wp-caption-text">MongoDB data</figcaption></figure>
<figure style="width: 1000px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1000/1%2A7L589dgoMSEiOumFTh5adQ.png?resize=800%2C204&#038;ssl=1" alt="" width="800" height="204" /><figcaption class="wp-caption-text">Tasks Successfully Completed</figcaption></figure>
</figure>
<h2>Tree View of each DAG run</h2>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figure style="width: 1500px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AedTrxUAskSUHKaVRyMNNlg.png?resize=800%2C257&#038;ssl=1" alt="" width="800" height="257" /><figcaption class="wp-caption-text">Tree View of each DAG run</figcaption></figure>
</figure>
<figure class="graf graf--figure">
<figcaption class="imageCaption"></figcaption>
</figure>
<h2 class="graf graf--h3">Handling Cases</h2>
<p class="graf graf--p">You must be wondering why to use this setup of airflow for simple scraping. The reason is,</p>
<ol>
<li class="graf graf--p">We can break down the whole single task into multiple tasks and have control over each task at any point.</li>
<li class="graf graf--p">Will have clear logs at every level.</li>
<li class="graf graf--p">Can easily connect to other servers with airflow operators to execute the script.</li>
</ol>
<h3 class="graf graf--p">Here are the few cases handled in the work-flow</h3>
<ul class="postList">
<li class="graf graf--li">When we are trying to write the same set of data into the Database with multiple DAG runs.</li>
</ul>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figure style="width: 1344px" class="wp-caption aligncenter"><img data-recalc-dims="1" loading="lazy" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AjOghgSV1knE290aFknHQIQ.png?resize=800%2C209&#038;ssl=1" alt="" width="800" height="209" /><figcaption class="wp-caption-text">Duplicate Key Error</figcaption></figure>
<figcaption class="imageCaption"></figcaption>
</figure>
<p class="graf graf--p"><code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">task_id=writing_to_db</strong></code> will be handling this case.</p>
<ul class="postList">
<li class="graf graf--li">When the data scraped and pushed to the database doesn&#8217;t match.</li>
</ul>
<figure class="graf graf--figure graf--layoutOutsetCenter"><img data-recalc-dims="1" decoding="async" class="graf-image" src="https://i0.wp.com/cdn-images-1.medium.com/max/1500/1%2AcoOCTtnVRr_svwadeJIDPA.png?w=800&#038;ssl=1" /></figure>
<p class="graf graf--p"><code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">task_id=validate_db</strong></code> will be handling this case. In case the anomaly is detected, we will be raising the <code class="markup--code markup--p-code"><strong class="markup--strong markup--p-strong">Value Error</strong></code>.</p>
<figure class="graf graf--figure graf--layoutOutsetCenter">
<figcaption class="imageCaption"></figcaption>
</figure>


<p></p>
<p>The post <a href="https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/">How to monitor work-flow of scraping project with Apache-Airflow</a> appeared first on <a href="https://turbolab.in">Turbolab Technologies</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://turbolab.in/how-to-monitor-work-flow-of-scraping-project-with-apache-airflow/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">823</post-id>	</item>
	</channel>
</rss>
