Scott Hanselman

Dynamically generating robots.txt for ASP.NET Core sites based on environment

June 19, 2019 Comment on this post [4] Posted in ASP.NET | DotNetCore | Open Source
Sponsored By

I'm putting part of older WebForms portions of my site that still run on bare metal to ASP.NET Core and Azure App Services, and while I'm doing that I realized that I want to make sure my staging sites don't get indexed by Google/Bing.

I already have a robots.txt, but I want one that's specific to production and others that are specific to development or staging. I thought about a number of ways to solve this. I could have a static robots.txt and another robots-staging.txt and conditionally copy one over the other during my Azure DevOps CI/CD pipeline.

Then I realized the simplest possible thing would be to just make robots.txt be dynamic. I thought about writing custom middleware but that sounded like a hassle and more code that needed. I wanted to see just how simple this could be.

  • You could do this as a single inline middleware, and just lambda and func and linq the heck out out it all on one line
  • You could write your own middleware and do lots of options, then activate it bested on env.IsStaging(), etc.
  • You could make a single Razor Page with environment taghelpers.

The last one seemed easiest and would also mean I could change the cshtml without a full recompile, so I made a RobotsTxt.cshtml single razor page. No page model, no code behind. Then I used the built-in environment tag helper to conditionally generate parts of the file. Note also that I forced the mime type to text/plain and I don't use a Layout page, as this needs to stand alone.

@page
@{
Layout = null;
this.Response.ContentType = "text/plain";
}
# /robots.txt file for http://www.hanselman.com/
User-agent: *
<environment include="Development,Staging">Disallow: /</environment>
<environment include="Production">Disallow: /blog/private
Disallow: /blog/secret
Disallow: /blog/somethingelse</environment>

I then make sure that my Staging and/or Production systems have ASPNETCORE_ENVIRONMENT variables set appropriately.

ASPNETCORE_ENVIRONMENT=Staging

I also want to point out what may look like odd spacing and how some text is butted up against the TagHelpers. Remember that a TagHelper's tag sometimes "disappears" (is elided) when it's done its thing, but the whitespace around it remains. So I want User-agent: * to have a line, and then Disallow to show up immediately on the next line. While it might be prettier source code to have that start on another line, it's not a correct file then. I want the result to be tight and above all, correct. This is for staging:

User-agent: *
Disallow: /

This now gives me a robots.txt at /robotstxt but not at /robots.txt. See the issue? Robots.txt is a file (or a fake one) so I need to map a route from the request for /robots.txt to the Razor page called RobotsTxt.cshtml.

Here I add a RazorPagesOptions in my Startup.cs with a custom PageRoute that maps /robots.txt to /robotstxt. (I've always found this API annoying as the parameters should, IMHO, be reversed like ("from","to") so watch out for that, lest you waste ten minutes like I just did.

public void ConfigureServices(IServiceCollection services)
{
services.AddMvc()
.AddRazorPagesOptions(options =>
{
options.Conventions.AddPageRoute("/robotstxt", "/Robots.Txt");
});
}

And that's it! Simple and clean.

You could also add caching if you wanted, either as a larger middleware, or even in the cshtml Page, like

context.Response.Headers.Add("Cache-Control", $"max-age=SOMELARGENUMBEROFSECONDS");

but I'll leave that small optimization as an exercise to the reader.

UPDATE: After I was done I found this robots.txt middleware and NuGet up on GitHub. I'm still happy with my code and I don't mind not having an external dependency, but it's nice to file this one away for future more sophisticated needs and projects.

How do you handle your robots.txt needs? Do you even have one?


Sponsor: Get the latest JetBrains Rider with WinForms designer, Edit & Continue, and an IL (Intermediate Language) viewer. Preliminary C# 8.0 support, rename refactoring for F#-defined symbols across your entire solution, and Custom Themes are all included.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook bluesky subscribe
About   Newsletter
Hosting By
Hosted on Linux using .NET in an Azure App Service
June 21, 2019 9:44
I would also recommend applying this through on page meta tag, as well as a HTTP header for non HTML resources. We would like to assume that all crawlers obey robots, but we cant guarantee it.

In a worse case scenario, I would also canonicalise those environments back to the production environment.

My experience with this is building and maintaining a CMS for a digital marketing agency where there is a heavy reliance on strict SEO control.
June 21, 2019 16:27
I use a simple rewrite rule. It looks at http_host (domain) and says that if it doesn't start with www, then rewrite to the non-live robot.txt, otherwise carry on and serve the regular robots.txt


<rewrite>
<rules>
<clear />
<rule name="Robots.txt">
<match url="robots.txt" />
<conditions>
<add input="{HTTP_HOST}" negate="true" pattern="^www" />
</conditions>
<action type="Rewrite" url="non-live-robots.txt" />
</rule>
</rules>
</rewrite>


You can of course tweak the conditions to match whatever you want.
June 26, 2019 13:41
First thing I thought of when I started reading this post was a Razor page like you ended up using. Why? Because I'm awesome.
June 28, 2019 16:34
How about a T4 template?

Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.